alamb commented on issue #13261:
URL: https://github.com/apache/datafusion/issues/13261#issuecomment-2460335893
> I agree with the assessment that the information must be coning from the
file reader itself.
I also agree with this assessment
In general I am not sure a SQL level solution will work well in general.
Some challenges:
- `ctx.read_parquet("foo.parquet")` may read the file in parallel,
interleaving the rows
- `ctx.read_parquet("<directory>")` can read more than one file and the row
off set / position are per file
However, the DataFrame API you sketch out above seems reasonable and a
relatively small part
THe other systems I know that support Delete Vectors (e.g. Vectica)
basically have
1. A special flag on the scan node (`ParquetExec` in DataFusion) that says
to emit positions (in addition to potentially adding filters, etc)
2. A special operator that knows how to take a stream of positions and
encode them as whatever delete vectory format there is.
So in DataFusion this might look more like adding a method to
`TableProvider` like `TableProvider::delete_from` similar to
https://docs.rs/datafusion/latest/datafusion/catalog/trait.TableProvider.html#method.insert_into
And then each table provider would implement whatever API (which would
likely involve positions as you describe)
This would allow DataFusion to handle the planning of DELETE
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]