gobraves commented on issue #6983:
URL:
https://github.com/apache/arrow-datafusion/issues/6983#issuecomment-1660801140
hi @alamb, I apologize for the delayed response. Based on your tips, I
executed the following commands in the CLI and also ran the code you provided
to reproduce the issue. I noticed that executing the commands in the CLI was
almost 8 times faster than running the code mentioned above, which is
consistent with my CPU core count.
Here are the commands I executed in the CLI:
create external table test stored as parquet location 'part-0.parquet';
create table t as select * from test;
explain create table t as select * from test;
In the logical_plan of the explain output, I observed `CreateMemoryTable`
and `TableScan`. Consequently, I reviewed the code for `CreateMemoryTable` in
the datafusion-cli and the `.cache() ` function, hoping to identify the
differences. I noticed that the target_partitions are indeed passed in both
cases, but I'm unsure why they are not utilized in `.cache()`. However, from
the commit mentioned in issue #6984 , it seems that the problem is resolved by
using repartitioning. Therefore, it appears that the difference lies in one
implementation using `Partitioning`, while the other does not. However, when
browsing through the code myself, I couldn't find any relevant settings. If
this is the case, could you please provide some hints as to which part of the
code this operation occurs?
I have one more question: Do we need to create a new DmlStatement to address
this issue?
> Perhaps this could be done by creating a LogicalPlan::DmlStatement for
write and then letting the existing insert machinery work rather than doing a
custom "collect".
I'm not entirely clear about this statement, and I believe it might be
because I haven't fully grasped the problem described above.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]