Hi,

The old statistics have many problems. The sorting order was not defined 
properly and by specification it does not care about the logical type which can 
modify the order (e.g. UTF8 vs. DECIMAL for the primitive type BINARY.) See 
PARQUET-686 <https://issues.apache.org/jira/browse/PARQUET-686> for more 
details. If you use parquet 1.9.0 you can use the configuration 
"parquet.strings.signed-min-max.enabled” to read/write statistics for 
string-ish BINARY logical types (none, UTF8, ENUM, JSON). If you use it, be 
sure that all of your related BINARY values use only the lower 7 bits of their 
bytes so the signed comparison would produce the same results as the proper 
unsigned would.
Because of these problems with the already existing incorrect min/max values we 
decided to specify new ones. It is already implemented but not yet released. 
See PARQUET-1025 <https://issues.apache.org/jira/browse/PARQUET-1025> for 
details.

Regards,
Gabor

> On 13 Feb 2018, at 19:04, Siva Gudavalli <gudavalli.s...@yahoo.com.INVALID> 
> wrote:
> 
> 
> Hello,
> 
> 3 MB files are too small for parquet, Try to increase the size.
> Keep an eye on statistics. In our case, we haven’t seen statistics being 
> generated for string data types and will perform a Scan.
> 
> 
> Regards
> Shiv
> 
> 
>> On Feb 12, 2018, at 9:24 PM, ilegend <511618...@qq.com> wrote:
>> 
>> Hi guys,
>> We're testing parquet performance for our big data environment. Parquet is 
>> better than orc, but we believe that the parquet has more potential. Any 
>> comments and suggestions are welcomed. The test environment is as follows:
>> 1. Server 48 cores + 256gb memory.
>> 2. Spark 2.1.0 + hdfs 2.6.0 + parquet-mr-1.8.1 
>> +parquet-format-2.3.0-incubating.
>> 3. The size of hdfs file is 3MB.
>> 4. Parquet-me sets default values, row group size 128MB, data page size 1MB.
>> 
>> 
>> 发自我的 iPhone
>> 
>> 
> 

Reply via email to