devinjdangelo commented on PR #7743: URL: https://github.com/apache/arrow-datafusion/pull/7743#issuecomment-1751409949
> I found a few other issues, but I don't think they are caused by this PR > > ```shell > $ mkdir /tmp/output > $ datafusion-cli > DataFusion CLI v31.0.0 > ❯ create external table output(time timestamp) stored as parquet location '/tmp/output' with order (time); > 0 rows in set. Query took 0.003 seconds. > > ❯ insert into output values (now()), (now() - interval '1 minute'), (now() + interval '2 minutes'), (now() - interval '3 minutes'); > +-------+ > | count | > +-------+ > | 4 | > +-------+ > 1 row in set. Query took 0.144 seconds. > > ❯ select * from output; > +----------------------------+ > | time | > +----------------------------+ > | 2023-10-06T20:31:08.042535 | > | 2023-10-06T20:33:08.042535 | > | 2023-10-06T20:34:08.042535 | > | 2023-10-06T20:36:08.042535 | > +----------------------------+ > 4 rows in set. Query took 0.005 seconds. > ``` > > However, if I insert the same data again, now the data is not sorted! > > ``` > ❯ insert into output values (now()), (now() - interval '1 minute'), (now() + interval '2 minutes'), (now() - interval '3 minutes'); > +-------+ > | count | > +-------+ > | 4 | > +-------+ > 1 row in set. Query took 0.178 seconds. > > ❯ select * from output; > +----------------------------+ > | time | > +----------------------------+ > | 2023-10-06T20:32:38.514236 | > | 2023-10-06T20:34:38.514236 | > | 2023-10-06T20:35:38.514236 | > | 2023-10-06T20:37:38.514236 | > | 2023-10-06T20:31:08.042535 | > | 2023-10-06T20:33:08.042535 | > | 2023-10-06T20:34:08.042535 | > | 2023-10-06T20:36:08.042535 | > +----------------------------+ > 8 rows in set. Query took 0.005 seconds. > ``` I actually think the above is correct behavior. The table is not globally sorted, but rather each individual file is sorted. Each time you insert, at least one new file is inserted. In the above result we see two independently sorted chunks, which means each insert created one new sorted file. > > I also found that there were a huge number of empty output files created > > ``` > alamb@MacBook-Pro-8:~/Software/arrow-datafusion2/datafusion-cli$ ls -ltr /tmp/output > total 256 > -rw-r--r--@ 1 alamb wheel 289B Oct 6 16:34 1PHmXyyoDVGbi7oo_1.parquet > -rw-r--r--@ 1 alamb wheel 289B Oct 6 16:34 1PHmXyyoDVGbi7oo_4.parquet > -rw-r--r--@ 1 alamb wheel 289B Oct 6 16:34 1PHmXyyoDVGbi7oo_8.parquet > -rw-r--r--@ 1 alamb wheel 289B Oct 6 16:34 1PHmXyyoDVGbi7oo_10.parquet > -rw-r--r--@ 1 alamb wheel 289B Oct 6 16:34 1PHmXyyoDVGbi7oo_2.parquet > -rw-r--r--@ 1 alamb wheel 289B Oct 6 16:34 1PHmXyyoDVGbi7oo_7.parquet > -rw-r--r--@ 1 alamb wheel 289B Oct 6 16:34 1PHmXyyoDVGbi7oo_13.parquet > -rw-r--r--@ 1 alamb wheel 289B Oct 6 16:34 1PHmXyyoDVGbi7oo_9.parquet > -rw-r--r--@ 1 alamb wheel 289B Oct 6 16:34 1PHmXyyoDVGbi7oo_3.parquet > -rw-r--r--@ 1 alamb wheel 289B Oct 6 16:34 1PHmXyyoDVGbi7oo_12.parquet > -rw-r--r--@ 1 alamb wheel 289B Oct 6 16:34 1PHmXyyoDVGbi7oo_6.parquet > -rw-r--r--@ 1 alamb wheel 289B Oct 6 16:34 1PHmXyyoDVGbi7oo_14.parquet > -rw-r--r--@ 1 alamb wheel 289B Oct 6 16:34 1PHmXyyoDVGbi7oo_15.parquet > -rw-r--r--@ 1 alamb wheel 289B Oct 6 16:34 1PHmXyyoDVGbi7oo_11.parquet > -rw-r--r--@ 1 alamb wheel 289B Oct 6 16:34 1PHmXyyoDVGbi7oo_5.parquet > -rw-r--r--@ 1 alamb wheel 615B Oct 6 16:34 1PHmXyyoDVGbi7oo_0.parquet > -rw-r--r--@ 1 alamb wheel 289B Oct 6 16:35 FtsEcvDwXi7JVaVq_2.parquet > -rw-r--r--@ 1 alamb wheel 289B Oct 6 16:35 FtsEcvDwXi7JVaVq_5.parquet > -rw-r--r--@ 1 alamb wheel 289B Oct 6 16:35 FtsEcvDwXi7JVaVq_6.parquet > -rw-r--r--@ 1 alamb wheel 289B Oct 6 16:35 FtsEcvDwXi7JVaVq_4.parquet > -rw-r--r--@ 1 alamb wheel 289B Oct 6 16:35 FtsEcvDwXi7JVaVq_12.parquet > -rw-r--r--@ 1 alamb wheel 289B Oct 6 16:35 FtsEcvDwXi7JVaVq_11.parquet > -rw-r--r--@ 1 alamb wheel 289B Oct 6 16:35 FtsEcvDwXi7JVaVq_3.parquet > -rw-r--r--@ 1 alamb wheel 289B Oct 6 16:35 FtsEcvDwXi7JVaVq_9.parquet > -rw-r--r--@ 1 alamb wheel 289B Oct 6 16:35 FtsEcvDwXi7JVaVq_8.parquet > -rw-r--r--@ 1 alamb wheel 289B Oct 6 16:35 FtsEcvDwXi7JVaVq_14.parquet > -rw-r--r--@ 1 alamb wheel 615B Oct 6 16:35 FtsEcvDwXi7JVaVq_0.parquet > -rw-r--r--@ 1 alamb wheel 289B Oct 6 16:35 FtsEcvDwXi7JVaVq_7.parquet > -rw-r--r--@ 1 alamb wheel 289B Oct 6 16:35 FtsEcvDwXi7JVaVq_15.parquet > -rw-r--r--@ 1 alamb wheel 289B Oct 6 16:35 FtsEcvDwXi7JVaVq_10.parquet > -rw-r--r--@ 1 alamb wheel 289B Oct 6 16:35 FtsEcvDwXi7JVaVq_13.parquet > -rw-r--r--@ 1 alamb wheel 289B Oct 6 16:35 FtsEcvDwXi7JVaVq_1.parquet > ``` > > But I don't think this is caused by this PR - #5383 Yeah, `FileSinks` currently output 1 file for each input stream. That number is determined by the optimizer. Based on your result, I would guess that your system has 16 vcores, so you end up with 1 stream containing data and 15 empty streams. We could solve this by making FileSinks a little more intelligent in how they partition the output. For example, they could be configured with a target file size. This idea is related to what will be required for #7744 . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
