tustvold commented on issue #1745:
URL: https://github.com/apache/arrow-rs/issues/1745#issuecomment-1140999321

   > I think that having two or three examples increasing in complexity and 
involving optionality and some amount of nesting would be good.
   
   Yes, if you're happy to contribute such documentation that would be amazing 
:+1: 
   
   > Do you think that it would make sense to go through the Arrow API even if 
I'm only looking to write Parquet files?
   
   I think this really depends on what the source of your data is, and if it 
can be cheaply read into arrow. The selling point of arrow is as a columnar 
interchange format, allowing different systems to pass around buffers in a way 
that they can efficiently process. Assuming you can cheaply convert your input 
data to arrow, it should be faster...
   
   That being said, currently the arrow writer has not had nearly as much 
attention paid to it as the reader side, and so will be slower in some cases 
than the row APIs. I've created a high level ticket #1764, but I'm not sure 
when I'll have time to get to it. 
   
   > The main gripe I have/had is around the whole Dremel logic that is hard to 
grasp
   
   Bit of an understatement here :laughing:, FWIW I've found this to be one of 
the more useful guides - 
https://akshays-blog.medium.com/wrapping-head-around-repetition-and-definition-levels-in-dremel-powering-bigquery-c1a33c9695da
   
   My point still stands that in theory the promise of arrow is someone else 
will have handled this for you, but your mileage may vary.
   
   > Well I originally had a very nested schema, involving maps, nullable 
lists, required lists with nullable elements, etc. I'm not yet fixed on a 
format since I want to measure performance for a set of usecases, so I'll 
experiment on the format.
   
   My 2 cents is that even if tooling supports nested schemas, it often comes 
with unexpected caveats. For example Presto/Trino has had bugs in projection 
pushdown for nested schemas for years. I would strongly advise that if you can 
flatten your schemas, you will save yourself a lot of headaches down the line 
if you do so :sweat_smile: 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to