great, and yes, I'm using the settings you provided me with :) .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0)
On Thu, Feb 4, 2016 at 11:24 PM, Ryan Blue <[email protected]> wrote: > Delta int64 encoding isn't released yet. We have a PR that I'm on the hook > for getting in. :) > > Also, it's one of the 2.0 format encodings, so you'll need that option > turned on. > > rb > > > On 02/04/2016 03:21 PM, Stefán Baxter wrote: > >> thnx. >> >> This a time-stamp field from a is a smaller sample using the new settings: >> Feb 4, 2016 11:06:43 PM INFO: >> org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 38,836B for >> [occurred_at] INT64: 24,000 values, 38,783B raw, 38,783B comp, 1 pages, >> encodings: [RLE_DICTIONARY, PLAIN], dic { 7,058 entries, 56,464B raw, >> 7,058B comp} >> >> Any reason that comes to mind why this is not a integer delta? (time >> between these entries is often a few seconds. >> >> -Stefan >> >> >> On Thu, Feb 4, 2016 at 11:17 PM, Ryan Blue <[email protected]> wrote: >> >> You should be getting the underlying data back instead of Timestamp >>> objects. You can pull in Avro 1.8.0 and use the conversions yourself >>> rather >>> than waiting for them to be included in the library. >>> >>> rb >>> >>> >>> On 02/04/2016 03:14 PM, Stefán Baxter wrote: >>> >>> I'm not looking to turn it off, absolutely not, I'm looking to use it in >>>> the most effective way :) >>>> >>>> Is there something I can do right now to force these fields to be >>>> timestamp >>>> fields in Parquet? >>>> >>>> Regards, >>>> -Stefan >>>> >>>> On Thu, Feb 4, 2016 at 11:03 PM, Ryan Blue <[email protected]> wrote: >>>> >>>> Got it. You can also turn off dictionary encoding with an option on the >>>> >>>>> builder. >>>>> >>>>> For timestamp, the support was just released in Avro 1.8.0 and there's >>>>> a >>>>> pending pull request for adding the same logical types API to >>>>> parquet-avro: >>>>> https://github.com/apache/parquet-mr/pull/318 >>>>> >>>>> Once that's merged, you'll just have to add conversions to your data >>>>> model >>>>> like this: >>>>> >>>>> GenericData model = new GenericData(); >>>>> model.addLogicalTypeConversion( >>>>> new TimeConversions.TimestampConversion()); >>>>> >>>>> Then pass that model into the builder. >>>>> >>>>> rb >>>>> >>>>> >>>>> On 02/04/2016 02:54 PM, Stefán Baxter wrote: >>>>> >>>>> Hi Ryan, >>>>> >>>>>> >>>>>> Thank you for taking the time. >>>>>> >>>>>> I'm using Drill (1.5-SNAPSHOT) and I have noticed that when I turn on >>>>>> the >>>>>> optional dictionary encoding it's used for almost >>>>>> anything/everything. I >>>>>> even have some time-stamp fields that is turned into a dictionary. (I >>>>>> would >>>>>> have guessed delta integer) >>>>>> >>>>>> I have a ~5M entries in my test file and the dictionary based one >>>>>> ends >>>>>> up >>>>>> 550mb and the non-dictionary based one ends up 790mb (still faster). >>>>>> >>>>>> So I started wondering if I could affect these decisions to compare >>>>>> size, >>>>>> speed etc. but I understand the rational behind automatic selection it >>>>>> just >>>>>> deemed somewhat naive in that Drill scenario. >>>>>> >>>>>> Another matter... can you point me to an example that shows me how to >>>>>> deal >>>>>> with Avro having no timestamp fields and conversion to Parquet. >>>>>> >>>>>> Best regards, >>>>>> -Stefán >>>>>> >>>>>> >>>>>> >>>>>> On Thu, Feb 4, 2016 at 10:17 PM, Ryan Blue <[email protected]> wrote: >>>>>> >>>>>> Hi Stefán, >>>>>> >>>>>> >>>>>>> The Schema converter will map Avro types to their Parquet >>>>>>> equivalents, >>>>>>> for >>>>>>> which there really aren't really choices or options. The mapping is >>>>>>> straight-forward, like long to int64. >>>>>>> >>>>>>> For the individual column encodings, Parquet chooses those >>>>>>> automatically >>>>>>> based on the column type and data. For example, dictionary encoding >>>>>>> is >>>>>>> used >>>>>>> if it gets better results than plain encoding and integer columns >>>>>>> always >>>>>>> use the bit packing and run-length encoding hybrid. There aren't many >>>>>>> choices you would make on a per-column basis here, either. >>>>>>> >>>>>>> There are two options you can control that affect encodings: the >>>>>>> dictionary page size and whether to use the 2.0 encodings. The 2.0 >>>>>>> encodings are delta binary and delta integer, which can't be read by >>>>>>> some >>>>>>> older readers or by Apache Impala. They get great compression on >>>>>>> certain >>>>>>> types though. You can also control the maximum dictionary size, which >>>>>>> could >>>>>>> help if you have columns that should be dictionary-encoded but are >>>>>>> falling >>>>>>> back to plain encoding because the dictionary gets too big. >>>>>>> >>>>>>> Both of those options are exposed by the builder when you create a >>>>>>> writer: >>>>>>> >>>>>>> AvroParquetWriter.builder(outputPath) >>>>>>> .withSchema(schema) >>>>>>> .withDataModel(ReflectData.get()) >>>>>>> >>>>>>> .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0) >>>>>>> .withDictionaryPageSize(2*1024*1024) >>>>>>> .build(); >>>>>>> >>>>>>> The default dictionary page size is 1MB. >>>>>>> >>>>>>> rb >>>>>>> >>>>>>> >>>>>>> On 02/04/2016 01:35 PM, Stefán Baxter wrote: >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> >>>>>>>> I'm using parquet-mr/parquet-avro to write parquet files. >>>>>>>> >>>>>>>> I want to control/override the encoding type for a column and I find >>>>>>>> no >>>>>>>> documentation or examples regarding that. >>>>>>>> >>>>>>>> My schema (MessageType) is converted with AvroSchemaConverter and I >>>>>>>> wonder >>>>>>>> how I can either set or hint columns to use a particular encoding >>>>>>>> option. >>>>>>>> Is that possible? >>>>>>>> >>>>>>>> Regards, >>>>>>>> -Stefán >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> >>>>>>> Ryan Blue >>>>>>> Software Engineer >>>>>>> Cloudera, Inc. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> -- >>>>> Ryan Blue >>>>> Software Engineer >>>>> Cloudera, Inc. >>>>> >>>>> >>>>> >>>> >>> -- >>> Ryan Blue >>> Software Engineer >>> Cloudera, Inc. >>> >>> >> > > -- > Ryan Blue > Software Engineer > Cloudera, Inc. >
