Hi Ryan, Can you tell me when the Big64 Delta encoding will be available as a part of your release and if Drill will need an updated Parquet version to read it?
Regards, -Stefan On Thu, Feb 4, 2016 at 11:25 PM, Stefán Baxter <[email protected]> wrote: > > great, and yes, I'm using the settings you provided me with :) > > .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0) > > > > On Thu, Feb 4, 2016 at 11:24 PM, Ryan Blue <[email protected]> wrote: > >> Delta int64 encoding isn't released yet. We have a PR that I'm on the >> hook for getting in. :) >> >> Also, it's one of the 2.0 format encodings, so you'll need that option >> turned on. >> >> rb >> >> >> On 02/04/2016 03:21 PM, Stefán Baxter wrote: >> >>> thnx. >>> >>> This a time-stamp field from a is a smaller sample using the new >>> settings: >>> Feb 4, 2016 11:06:43 PM INFO: >>> org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 38,836B for >>> [occurred_at] INT64: 24,000 values, 38,783B raw, 38,783B comp, 1 pages, >>> encodings: [RLE_DICTIONARY, PLAIN], dic { 7,058 entries, 56,464B raw, >>> 7,058B comp} >>> >>> Any reason that comes to mind why this is not a integer delta? (time >>> between these entries is often a few seconds. >>> >>> -Stefan >>> >>> >>> On Thu, Feb 4, 2016 at 11:17 PM, Ryan Blue <[email protected]> wrote: >>> >>> You should be getting the underlying data back instead of Timestamp >>>> objects. You can pull in Avro 1.8.0 and use the conversions yourself >>>> rather >>>> than waiting for them to be included in the library. >>>> >>>> rb >>>> >>>> >>>> On 02/04/2016 03:14 PM, Stefán Baxter wrote: >>>> >>>> I'm not looking to turn it off, absolutely not, I'm looking to use it in >>>>> the most effective way :) >>>>> >>>>> Is there something I can do right now to force these fields to be >>>>> timestamp >>>>> fields in Parquet? >>>>> >>>>> Regards, >>>>> -Stefan >>>>> >>>>> On Thu, Feb 4, 2016 at 11:03 PM, Ryan Blue <[email protected]> wrote: >>>>> >>>>> Got it. You can also turn off dictionary encoding with an option on the >>>>> >>>>>> builder. >>>>>> >>>>>> For timestamp, the support was just released in Avro 1.8.0 and >>>>>> there's a >>>>>> pending pull request for adding the same logical types API to >>>>>> parquet-avro: >>>>>> https://github.com/apache/parquet-mr/pull/318 >>>>>> >>>>>> Once that's merged, you'll just have to add conversions to your data >>>>>> model >>>>>> like this: >>>>>> >>>>>> GenericData model = new GenericData(); >>>>>> model.addLogicalTypeConversion( >>>>>> new TimeConversions.TimestampConversion()); >>>>>> >>>>>> Then pass that model into the builder. >>>>>> >>>>>> rb >>>>>> >>>>>> >>>>>> On 02/04/2016 02:54 PM, Stefán Baxter wrote: >>>>>> >>>>>> Hi Ryan, >>>>>> >>>>>>> >>>>>>> Thank you for taking the time. >>>>>>> >>>>>>> I'm using Drill (1.5-SNAPSHOT) and I have noticed that when I turn on >>>>>>> the >>>>>>> optional dictionary encoding it's used for almost >>>>>>> anything/everything. I >>>>>>> even have some time-stamp fields that is turned into a dictionary. (I >>>>>>> would >>>>>>> have guessed delta integer) >>>>>>> >>>>>>> I have a ~5M entries in my test file and the dictionary based one >>>>>>> ends >>>>>>> up >>>>>>> 550mb and the non-dictionary based one ends up 790mb (still faster). >>>>>>> >>>>>>> So I started wondering if I could affect these decisions to compare >>>>>>> size, >>>>>>> speed etc. but I understand the rational behind automatic selection >>>>>>> it >>>>>>> just >>>>>>> deemed somewhat naive in that Drill scenario. >>>>>>> >>>>>>> Another matter... can you point me to an example that shows me how to >>>>>>> deal >>>>>>> with Avro having no timestamp fields and conversion to Parquet. >>>>>>> >>>>>>> Best regards, >>>>>>> -Stefán >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Thu, Feb 4, 2016 at 10:17 PM, Ryan Blue <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>> Hi Stefán, >>>>>>> >>>>>>> >>>>>>>> The Schema converter will map Avro types to their Parquet >>>>>>>> equivalents, >>>>>>>> for >>>>>>>> which there really aren't really choices or options. The mapping is >>>>>>>> straight-forward, like long to int64. >>>>>>>> >>>>>>>> For the individual column encodings, Parquet chooses those >>>>>>>> automatically >>>>>>>> based on the column type and data. For example, dictionary encoding >>>>>>>> is >>>>>>>> used >>>>>>>> if it gets better results than plain encoding and integer columns >>>>>>>> always >>>>>>>> use the bit packing and run-length encoding hybrid. There aren't >>>>>>>> many >>>>>>>> choices you would make on a per-column basis here, either. >>>>>>>> >>>>>>>> There are two options you can control that affect encodings: the >>>>>>>> dictionary page size and whether to use the 2.0 encodings. The 2.0 >>>>>>>> encodings are delta binary and delta integer, which can't be read by >>>>>>>> some >>>>>>>> older readers or by Apache Impala. They get great compression on >>>>>>>> certain >>>>>>>> types though. You can also control the maximum dictionary size, >>>>>>>> which >>>>>>>> could >>>>>>>> help if you have columns that should be dictionary-encoded but are >>>>>>>> falling >>>>>>>> back to plain encoding because the dictionary gets too big. >>>>>>>> >>>>>>>> Both of those options are exposed by the builder when you create a >>>>>>>> writer: >>>>>>>> >>>>>>>> AvroParquetWriter.builder(outputPath) >>>>>>>> .withSchema(schema) >>>>>>>> .withDataModel(ReflectData.get()) >>>>>>>> >>>>>>>> .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0) >>>>>>>> .withDictionaryPageSize(2*1024*1024) >>>>>>>> .build(); >>>>>>>> >>>>>>>> The default dictionary page size is 1MB. >>>>>>>> >>>>>>>> rb >>>>>>>> >>>>>>>> >>>>>>>> On 02/04/2016 01:35 PM, Stefán Baxter wrote: >>>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> >>>>>>>>> I'm using parquet-mr/parquet-avro to write parquet files. >>>>>>>>> >>>>>>>>> I want to control/override the encoding type for a column and I >>>>>>>>> find >>>>>>>>> no >>>>>>>>> documentation or examples regarding that. >>>>>>>>> >>>>>>>>> My schema (MessageType) is converted with AvroSchemaConverter and I >>>>>>>>> wonder >>>>>>>>> how I can either set or hint columns to use a particular encoding >>>>>>>>> option. >>>>>>>>> Is that possible? >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> -Stefán >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> >>>>>>>> Ryan Blue >>>>>>>> Software Engineer >>>>>>>> Cloudera, Inc. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> -- >>>>>> Ryan Blue >>>>>> Software Engineer >>>>>> Cloudera, Inc. >>>>>> >>>>>> >>>>>> >>>>> >>>> -- >>>> Ryan Blue >>>> Software Engineer >>>> Cloudera, Inc. >>>> >>>> >>> >> >> -- >> Ryan Blue >> Software Engineer >> Cloudera, Inc. >> > >
