[jira] [Commented] (CALCITE-2040) Create adapter for Apache Arrow

Jonathan Swenson (Jira) Thu, 19 May 2022 20:43:04 -0700


    [ 
https://issues.apache.org/jira/browse/CALCITE-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17539905#comment-17539905
 ]


Jonathan Swenson commented on CALCITE-2040:
-------------------------------------------

[~julianhyde] what do you think is required here to get this over the line? 

It appears as though the PR is a reasonable baseline, but a lot of pieces are 
not yet implemented. 

For example: ORs, INs, IS (NOT) NULL, projection of dates, more complicated 
boolean logic (WHERE (x > 5) IS NOT FALSE), joins, unions, certain casts, 
certain filter literals. 

I attempted to write some simple tests for many of the cases you called out, 
but many of them do not work -- either failing in the:

+ Calcite calc layer due to data not being mapped as expected (dates come back 
as date_day integers which get mapped to timestamps)

+ unimplemented parts of gandiva (comparison of a smallint (or byte) with an 
integer)

+ missing implementations in the translation layer from calcite to gandiva 
expression trees (no mapping of ORs or null checking clauses right now). 

 

In most cases the query fails, but in some cases the results appear to be 
incorrect (projection of dates) or simply missing (joins) for unsupported 
functionality.

> Create adapter for Apache Arrow
> -------------------------------
>
>                 Key: CALCITE-2040
>                 URL: https://issues.apache.org/jira/browse/CALCITE-2040
>             Project: Calcite
>          Issue Type: Bug
>            Reporter: Julian Hyde
>            Assignee: Julian Hyde
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: arrow_data.py
>
>          Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Create an adapter for [Apache Arrow|http://arrow.apache.org/]. This would 
> allow people to execute SQL statements, via JDBC or ODBC, on data stored in 
> Arrow in-memory format.
> Since Arrow is an in-memory format, it is not as straightforward as reading, 
> say, CSV files using the file adapter: an Arrow data set does not have a URL. 
> (Unless we use Arrow's 
> [Feather|https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/]
>  format, or use an in-memory file system such as Alluxio.) So we would need 
> to devise a way of addressing Arrow data sets.
> Also, since Arrow is an extremely efficient format for processing data, it 
> would also be good to have Arrow as a calling convention. That is, 
> implementations of relational operators such as Filter, Project, Aggregate in 
> addition to just TableScan.
> Lastly, when we have an Arrow convention, if we build adapters for file 
> formats (for instance the bioinformatics formats SAM, VCF, FASTQ discussed in 
> CALCITE-2025) it would make a lot of sense to translate those formats 
> directly into Arrow (applying simple projects and filters first if 
> applicable). Those adapters would belong as a "contrib" module in the Arrow 
> project better than in Calcite.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (CALCITE-2040) Create adapter for Apache Arrow

Reply via email to