[GitHub] [arrow-rs] tustvold commented on issue #1745: Lack of examples on parquet file write

GitBox Mon, 30 May 2022 03:43:05 -0700


tustvold commented on issue #1745:
URL: https://github.com/apache/arrow-rs/issues/1745#issuecomment-1140999321

> I think that having two or three examples increasing in complexity and
involving optionality and some amount of nesting would be good.

Yes, if you're happy to contribute such documentation that would be amazing
:+1:

> Do you think that it would make sense to go through the Arrow API even if
I'm only looking to write Parquet files?

I think this really depends on what the source of your data is, and if it
can be cheaply read into arrow. The selling point of arrow is as a columnar
interchange format, allowing different systems to pass around buffers in a way
that they can efficiently process. Assuming you can cheaply convert your input
data to arrow, it should be faster...

That being said, currently the arrow writer has not had nearly as much
attention paid to it as the reader side, and so will be slower in some cases
than the row APIs. I've created a high level ticket #1764, but I'm not sure
when I'll have time to get to it.

> The main gripe I have/had is around the whole Dremel logic that is hard to
grasp

Bit of an understatement here :laughing:, FWIW I've found this to be one of
the more useful guides -
https://akshays-blog.medium.com/wrapping-head-around-repetition-and-definition-levels-in-dremel-powering-bigquery-c1a33c9695da

My point still stands that in theory the promise of arrow is someone else
will have handled this for you, but your mileage may vary.

> Well I originally had a very nested schema, involving maps, nullable
lists, required lists with nullable elements, etc. I'm not yet fixed on a
format since I want to measure performance for a set of usecases, so I'll
experiment on the format.

My 2 cents is that even if tooling supports nested schemas, it often comes
with unexpected caveats. For example Presto/Trino has had bugs in projection
pushdown for nested schemas for years. I would strongly advise that if you can
flatten your schemas, you will save yourself a lot of headaches down the line
if you do so :sweat_smile:

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on issue #1745: Lack of examples on parquet file write

Reply via email to