akshatshenoi-eng commented on PR #56193:
URL: https://github.com/apache/spark/pull/56193#issuecomment-4597256255
> @akshatshenoi-eng I think we should support this all in Text and JSON as
well with sharing the same codebase. Also do you mind explaining how it's going
to work? e.g., if partitioned table is all tar-gzed, would Spark recognize the
structure? Or would you read all them in single dataframe?
>
> In addition, how do we handle the physical partitions? Would we distribute
them quite well?
Text and JSON support are both planned. I made ArchiveReader and
streamArchiveEntries format agnostic so adding that support should be
straightforward since both have stream-based parsers (same as CSV). Parquet,
ORC, Avro, XML, and Excel are also planned I'm still figuring Parquet out since
it can't be streamed like CSV is. I just wanted to start with CSV to validate
the streaming design end-to-end before scaling to other formats.
Spark recognizes the partition structure correctly. Partition discovery
happens at the directory level, independent of file format. If the layout is:
s3://bucket/dt=2024-01-01/data.tar.gz
s3://bucket/dt=2024-01-02/data.tar.gz
each archive becomes a PartitionedFile with its partition values already
attached (dt=2024-01-01, etc.). When the archive is streamed, every row
produced from its entries inherits those partition values automatically.
Each archive is a single Spark partition because tar is a sequential stream
(isSplitable returns false, so Spark can't carve it into byte-range splits).
The distribution across executors scales with the number of archive files: 10
archives → 10 tasks, which distribute across the cluster normally. The current
limitation is that a single large archive isn't parallelized but that is also
on the roadmap to be handled later.
Sorry for anything else that may be vague or not yet implemented my intern
project is enabling multi-file archive read support for tar, tar.gz, zip and 7z.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]