wjones127 commented on PR #7716:
URL: 
https://github.com/apache/arrow-datafusion/pull/7716#issuecomment-1743357905

   > btw I would love to know why you chose DataFusion and how you are using it 
-- among other things, it might make an excellent example usecase for
   
   Lance is essentially a table format (like Delta Lake). These blur the line 
between data format and database, so it requires database components to build, 
such as a expression library. In addition, one of Lance's distinguishing 
feature is support for secondary indexes (right now, just ANN indexes for 
approximate KNN search). In order to use these, we need to have query plans to 
handle scanning both indexed data and yet-to-be-indexed data in parallel and 
combine the two in a query. We use DataFusion to do this.
   
   The two things we like about DataFusion in particular are: (1) it's easy to 
extend with new query nodes and (2) it's Arrow-native. For operations like 
scanning indices and our `Take` operation (get additional columns by their 
known row locations). DataFusion being Arrow-native has meant it's been easy to 
integrate with PyArrow and the larger Python data ecosystem. For example, we 
have many APIs where users write Python functions that operation on 
RecordBatches, and these can operate directly on the data without having to do 
any conversion. (We are very heavy users of the C Data Interface.)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to