[
https://issues.apache.org/jira/browse/CALCITE-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538624#comment-17538624
]
Jonathan Swenson commented on CALCITE-2040:
-------------------------------------------
[~julianhyde] it appears as though ARROW-11135 has been fixed (probably in
arrow 7.0.0). I was able to upgrade to arrow and arrow-gandiva to 8.0.0
(latest) and was able to run the tests on Mac.
Working off [julian's
branch|https://github.com/julianhyde/calcite/tree/2040-arrow], I merged in the
final changes from the [original
branch|https://github.com/apache/calcite/pull/2133] and resolved a few
differences in how the two had diverged. In addition I made the changes
suggested by [~vladimirsitnikov] to implement the data generation as part of
the test setup (using a @TempDir in junit).
Those changes can be found
[here|https://github.com/apache/calcite/compare/main...jonathanswenson:swenson/arrow-upgrade?expand=1]
There are a few things that I ran into that probably need to be considered:
* I don't believe that filters apply over the whole dataset, but only seem to
apply to the first record batch. I had to [update the batch size to
20|https://github.com/apache/calcite/compare/main...jonathanswenson:swenson/arrow-upgrade?expand=1#diff-857069b93915684ed8b3ffbfb26fe4aadc8a2eebd5ce9c956fc614757c77e48dR52]
to accommodate. It is relatively easy to write a test that confirms this (2 of
the tests will fail if you bump the batch size back down to 10). This feels
like a blocker to me.
* I had to move to an Intel mac in order to get this to execute the tests
properly – I got a linker error when trying to run the tests on an M1 Mac
(apple silicon). I don't know if that is a blocker for development – I had
another suite (the cassandra test suite) failing inexplicably on the M1 mac
that I'll probably dig more into / file an issue.
I'm not deep enough yet in the arrow / gandiva support to know what's happening
yet for either of these.
> Create adapter for Apache Arrow
> -------------------------------
>
> Key: CALCITE-2040
> URL: https://issues.apache.org/jira/browse/CALCITE-2040
> Project: Calcite
> Issue Type: Bug
> Reporter: Julian Hyde
> Assignee: Julian Hyde
> Priority: Major
> Labels: pull-request-available
> Attachments: arrow_data.py
>
> Time Spent: 1h 20m
> Remaining Estimate: 0h
>
> Create an adapter for [Apache Arrow|http://arrow.apache.org/]. This would
> allow people to execute SQL statements, via JDBC or ODBC, on data stored in
> Arrow in-memory format.
> Since Arrow is an in-memory format, it is not as straightforward as reading,
> say, CSV files using the file adapter: an Arrow data set does not have a URL.
> (Unless we use Arrow's
> [Feather|https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/]
> format, or use an in-memory file system such as Alluxio.) So we would need
> to devise a way of addressing Arrow data sets.
> Also, since Arrow is an extremely efficient format for processing data, it
> would also be good to have Arrow as a calling convention. That is,
> implementations of relational operators such as Filter, Project, Aggregate in
> addition to just TableScan.
> Lastly, when we have an Arrow convention, if we build adapters for file
> formats (for instance the bioinformatics formats SAM, VCF, FASTQ discussed in
> CALCITE-2025) it would make a lot of sense to translate those formats
> directly into Arrow (applying simple projects and filters first if
> applicable). Those adapters would belong as a "contrib" module in the Arrow
> project better than in Calcite.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)