Not sure I have all the details straight, but it seems like this caching can be problematic for very large lists. Is there a way to design this so it can span pages? Or does this not make sense since a single record has to fit in a page?
On Mon, Jun 30, 2014 at 9:10 AM, Jason Altekruse <[email protected]> wrote: > Hello Parquet devs, > > I have been working more on the Drill implementation of parquet to bring us > up to full read compatibility as well as implement write support. We are > using the RecordConsumer interface to write data into parquet files, and it > seems that we have hit a bug when writing repeated data. > > I am currently just doing a simple test with a repeated field at the root > of the schema. I am writing in data I am pulling in from a json file, where > each record contains one repeated Long column with seven items. The problem > is appearing when we hit one of the page thresholds, the ColumnWriterImpl > is writing only the values from one of the lists that fit in the current > page, not the entire list. Thus the 'value' within that column is being > split across two pages. I took a look at the source and it does not look > like the ColumnWriterImp is actually ensuring that a list ends before > cutting off the page. With the implementation of the repetition levels, I > believe this can only be indicated by actually reading one value from the > next list (when the repetition level hits 0). It seems like the actual > writes to the page data should be cached inside of the writer until it is > determines that the entire list of values will fit in the page. > > Is there something that I am missing? I did not find any open issues for > this, but I will work on a patch to see if I can get it working for me. > > Jason Altekruse > MapR - Software Engineer > Apache Drill Team >
