Re: [Python][Documentation] Add column limit recommendations Parquet page

Maarten Ballintijn Sat, 09 May 2020 14:29:18 -0700

Wes,

"Users would be well advised to not write columns with large numbers (> 1000) 
of columns"
You've mentioned this before and as this is in my experience not an uncommon 
use-case can you maybe expand a bit on the following related questions. 
(use-cases include daily or minute data for a few 10's of thousands items like 
stocks or other financial instruments, IoT sensors, etc).


Parquet Standard - Is the issue intrinsic to the Parquet standard you think? 
The ability to read a sub-set of the columns and/or row-groups, compact storage 
through the use of RLE, categoricals etc, all seem to point to the format being 
well suited for these use-cases
Parquet-C++ implementation - Is the issue with current Parquet-C++ 
implementation, or any of the dependencies? Is it something which could be 
fixed? Would a specialized implementation help? Is the problem related to going 
from Parquet -> Arrow -> Python/Pandas? E.g. would a Parquet -> numpy reader 
work better?
Alternatives - What would you recommend as a superior solution? Store this data 
tall i.s.o wide? Use another storage format?
Appreciate your (and others) insights.

Cheers, Maarten.

Re: [Python][Documentation] Add column limit recommendations Parquet page

Reply via email to