[ 
https://issues.apache.org/jira/browse/CALCITE-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538624#comment-17538624
 ] 

Jonathan Swenson commented on CALCITE-2040:
-------------------------------------------

[~julianhyde] it appears as though ARROW-11135 has been fixed (probably in 
arrow 7.0.0). I was able to upgrade to arrow and arrow-gandiva to 8.0.0 
(latest) and was able to run the tests on Mac.

Working off [julian's 
branch|https://github.com/julianhyde/calcite/tree/2040-arrow], I merged in the 
final changes from the [original 
branch|https://github.com/apache/calcite/pull/2133] and resolved a few 
differences in how the two had diverged. In addition I made the changes 
suggested by [~vladimirsitnikov] to implement the data generation as part of 
the test setup (using a @TempDir in junit).

Those changes can be found 
[here|https://github.com/apache/calcite/compare/main...jonathanswenson:swenson/arrow-upgrade?expand=1]

There are a few things that I ran into that probably need to be considered:
 * I don't believe that filters apply over the whole dataset, but only seem to 
apply to the first record batch. I had to [update the batch size to 
20|https://github.com/apache/calcite/compare/main...jonathanswenson:swenson/arrow-upgrade?expand=1#diff-857069b93915684ed8b3ffbfb26fe4aadc8a2eebd5ce9c956fc614757c77e48dR52]
 to accommodate. It is relatively easy to write a test that confirms this (2 of 
the tests will fail if you bump the batch size back down to 10). This feels 
like a blocker to me. 
 * I had to move to an Intel mac in order to get this to execute the tests 
properly – I got a linker error when trying to run the tests on an M1 Mac 
(apple silicon). I don't know if that is a blocker for development – I had 
another suite (the cassandra test suite) failing inexplicably on the M1 mac 
that I'll probably dig more into / file an issue.

I'm not deep enough yet in the arrow / gandiva support to know what's happening 
yet for either of these. 

> Create adapter for Apache Arrow
> -------------------------------
>
>                 Key: CALCITE-2040
>                 URL: https://issues.apache.org/jira/browse/CALCITE-2040
>             Project: Calcite
>          Issue Type: Bug
>            Reporter: Julian Hyde
>            Assignee: Julian Hyde
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: arrow_data.py
>
>          Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Create an adapter for [Apache Arrow|http://arrow.apache.org/]. This would 
> allow people to execute SQL statements, via JDBC or ODBC, on data stored in 
> Arrow in-memory format.
> Since Arrow is an in-memory format, it is not as straightforward as reading, 
> say, CSV files using the file adapter: an Arrow data set does not have a URL. 
> (Unless we use Arrow's 
> [Feather|https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/]
>  format, or use an in-memory file system such as Alluxio.) So we would need 
> to devise a way of addressing Arrow data sets.
> Also, since Arrow is an extremely efficient format for processing data, it 
> would also be good to have Arrow as a calling convention. That is, 
> implementations of relational operators such as Filter, Project, Aggregate in 
> addition to just TableScan.
> Lastly, when we have an Arrow convention, if we build adapters for file 
> formats (for instance the bioinformatics formats SAM, VCF, FASTQ discussed in 
> CALCITE-2025) it would make a lot of sense to translate those formats 
> directly into Arrow (applying simple projects and filters first if 
> applicable). Those adapters would belong as a "contrib" module in the Arrow 
> project better than in Calcite.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to