wjones127 commented on PR #7716: URL: https://github.com/apache/arrow-datafusion/pull/7716#issuecomment-1743357905
> btw I would love to know why you chose DataFusion and how you are using it -- among other things, it might make an excellent example usecase for Lance is essentially a table format (like Delta Lake). These blur the line between data format and database, so it requires database components to build, such as a expression library. In addition, one of Lance's distinguishing feature is support for secondary indexes (right now, just ANN indexes for approximate KNN search). In order to use these, we need to have query plans to handle scanning both indexed data and yet-to-be-indexed data in parallel and combine the two in a query. We use DataFusion to do this. The two things we like about DataFusion in particular are: (1) it's easy to extend with new query nodes and (2) it's Arrow-native. For operations like scanning indices and our `Take` operation (get additional columns by their known row locations). DataFusion being Arrow-native has meant it's been easy to integrate with PyArrow and the larger Python data ecosystem. For example, we have many APIs where users write Python functions that operation on RecordBatches, and these can operate directly on the data without having to do any conversion. (We are very heavy users of the C Data Interface.) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
