I'm not looking to turn it off, absolutely not, I'm looking to use it in the most effective way :)
Is there something I can do right now to force these fields to be timestamp fields in Parquet? Regards, -Stefan On Thu, Feb 4, 2016 at 11:03 PM, Ryan Blue <[email protected]> wrote: > Got it. You can also turn off dictionary encoding with an option on the > builder. > > For timestamp, the support was just released in Avro 1.8.0 and there's a > pending pull request for adding the same logical types API to parquet-avro: > https://github.com/apache/parquet-mr/pull/318 > > Once that's merged, you'll just have to add conversions to your data model > like this: > > GenericData model = new GenericData(); > model.addLogicalTypeConversion( > new TimeConversions.TimestampConversion()); > > Then pass that model into the builder. > > rb > > > On 02/04/2016 02:54 PM, Stefán Baxter wrote: > >> Hi Ryan, >> >> Thank you for taking the time. >> >> I'm using Drill (1.5-SNAPSHOT) and I have noticed that when I turn on the >> optional dictionary encoding it's used for almost anything/everything. I >> even have some time-stamp fields that is turned into a dictionary. (I >> would >> have guessed delta integer) >> >> I have a ~5M entries in my test file and the dictionary based one ends up >> 550mb and the non-dictionary based one ends up 790mb (still faster). >> >> So I started wondering if I could affect these decisions to compare size, >> speed etc. but I understand the rational behind automatic selection it >> just >> deemed somewhat naive in that Drill scenario. >> >> Another matter... can you point me to an example that shows me how to deal >> with Avro having no timestamp fields and conversion to Parquet. >> >> Best regards, >> -Stefán >> >> >> >> On Thu, Feb 4, 2016 at 10:17 PM, Ryan Blue <[email protected]> wrote: >> >> Hi Stefán, >>> >>> The Schema converter will map Avro types to their Parquet equivalents, >>> for >>> which there really aren't really choices or options. The mapping is >>> straight-forward, like long to int64. >>> >>> For the individual column encodings, Parquet chooses those automatically >>> based on the column type and data. For example, dictionary encoding is >>> used >>> if it gets better results than plain encoding and integer columns always >>> use the bit packing and run-length encoding hybrid. There aren't many >>> choices you would make on a per-column basis here, either. >>> >>> There are two options you can control that affect encodings: the >>> dictionary page size and whether to use the 2.0 encodings. The 2.0 >>> encodings are delta binary and delta integer, which can't be read by some >>> older readers or by Apache Impala. They get great compression on certain >>> types though. You can also control the maximum dictionary size, which >>> could >>> help if you have columns that should be dictionary-encoded but are >>> falling >>> back to plain encoding because the dictionary gets too big. >>> >>> Both of those options are exposed by the builder when you create a >>> writer: >>> >>> AvroParquetWriter.builder(outputPath) >>> .withSchema(schema) >>> .withDataModel(ReflectData.get()) >>> .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0) >>> .withDictionaryPageSize(2*1024*1024) >>> .build(); >>> >>> The default dictionary page size is 1MB. >>> >>> rb >>> >>> >>> On 02/04/2016 01:35 PM, Stefán Baxter wrote: >>> >>> Hi, >>>> >>>> I'm using parquet-mr/parquet-avro to write parquet files. >>>> >>>> I want to control/override the encoding type for a column and I find no >>>> documentation or examples regarding that. >>>> >>>> My schema (MessageType) is converted with AvroSchemaConverter and I >>>> wonder >>>> how I can either set or hint columns to use a particular encoding >>>> option. >>>> Is that possible? >>>> >>>> Regards, >>>> -Stefán >>>> >>>> >>>> >>> -- >>> Ryan Blue >>> Software Engineer >>> Cloudera, Inc. >>> >>> >> > > -- > Ryan Blue > Software Engineer > Cloudera, Inc. >
