akshatshenoi-eng commented on PR #56193:
URL: https://github.com/apache/spark/pull/56193#issuecomment-4597256255

   > @akshatshenoi-eng I think we should support this all in Text and JSON as 
well with sharing the same codebase. Also do you mind explaining how it's going 
to work? e.g., if partitioned table is all tar-gzed, would Spark recognize the 
structure? Or would you read all them in single dataframe?
   > 
   > In addition, how do we handle the physical partitions? Would we distribute 
them quite well?
   
   Text and JSON support are both planned. I made ArchiveReader and 
streamArchiveEntries format agnostic so adding that support should be 
straightforward since both have stream-based parsers (same as CSV). Parquet, 
ORC, Avro, XML, and Excel are also planned I'm still figuring Parquet out since 
it can't be streamed like CSV is. I just wanted to start with CSV to validate 
the streaming design end-to-end before scaling to other formats.
   
   Spark recognizes the partition structure correctly. Partition discovery 
happens at the directory level, independent of file format. If the layout is:
   
     s3://bucket/dt=2024-01-01/data.tar.gz
     s3://bucket/dt=2024-01-02/data.tar.gz
   
   each archive becomes a PartitionedFile with its partition values already 
attached (dt=2024-01-01, etc.). When the archive is streamed, every row 
produced from its entries inherits those partition values automatically.
   
   Each archive is a single Spark partition because tar is a sequential stream 
(isSplitable returns false, so Spark can't carve it into byte-range splits). 
The distribution across executors scales with the number of archive files: 10 
archives → 10 tasks, which distribute across the cluster normally. The current 
limitation is that a single large archive isn't parallelized but that is also 
on the roadmap to be handled later.
   
   Sorry for anything else that may be vague or not yet implemented my intern 
project is enabling multi-file archive read support for tar, tar.gz, zip and 7z.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to