Re: Rekindling the Apache Parquet monorepo discussion

2018-02-26 Thread Wes McKinney
In the interest of continuity, we could rename apache/parquet-mr to
apache/parquet, move some directories around, and then merge in the
parquet-format and parquet-cpp repos. There's probably some other
approaches that would be fine, too.

On Mon, Feb 26, 2018 at 9:40 PM, Nong Li  wrote:
> Are you thinking of merging all 3 into one or two at a time?
>
> On Mon, Feb 26, 2018 at 7:09 AM, Wes McKinney  wrote:
>
>> What I would suggest to do is to create a script to forms the merged
>> repository (so we can verify that "git blame" will still work as
>> intended). If there is consensus about doing this generally, we can
>> then debate the structure of the repo and other details. For example,
>> it could be similar to Apache Thrift's repo structure
>>
>>
>> On Mon, Feb 26, 2018 at 9:32 AM, Deepak Majeti 
>> wrote:
>> > +1. Compatibility benefits will be worthy.
>> >
>> > On Mon, Feb 26, 2018 at 12:48 AM, Nong Li  wrote:
>> >
>> >> I think this is a great idea. Let me know if I can help.
>> >>
>> >> On Sat, Feb 24, 2018 at 4:07 PM, Wes McKinney 
>> wrote:
>> >>
>> >> > hi folks,
>> >> >
>> >> > in a past sync we discussed the prospect of combining all of the
>> >> > Parquet subprojects into a single code repo. Since there are some
>> >> > other programming languages which may join the fold (Rust, C#.NET), it
>> >> > would be beneficial to combine everything into a single repository to
>> >> > assist with integration and compatibility testing.
>> >> >
>> >> > Subprojects (C++, Java, etc.) could still have their own versioned
>> >> > releases.
>> >> >
>> >> > Is this still of interest to the community? I would be willing to
>> >> > assist with this effort. In theory the repo merge could be performed
>> >> > without loss of git history in the respective projects.
>> >> >
>> >> > - Wes
>> >> >
>> >>
>> >
>> >
>> >
>> > --
>> > regards,
>> > Deepak Majeti
>>


Re: Contributing parquet-rs to Apache?

2018-02-26 Thread Chao Sun
Thanks Ryan. The code is currently owned by me, and is being worked by Ivan
and myself. Yes, I believe both of us will continue working on it after
moving to Apache.

Best,
Chao

On Mon, Feb 26, 2018 at 10:36 AM, Ryan Blue 
wrote:

> Chao,
>
> From looking at the doc that Wes sent, I think the first thing to do is to
> find out who own copyright for the code that would be imported. Then we
> would plan how to get grants or license agreements from copyright holders,
> and finally start a resolution in the incubator to accept the code.
>
> So I guess the question is: who owns the code and will those people or
> organizations work on it once it is moved to Apache?
>
> rb
>
> On Fri, Feb 23, 2018 at 11:06 PM, Chao Sun  wrote:
>
> > Thanks Wes. We are ready to start the IP clearance - can some PMC help to
> > guide through the process?
> >
> > Best,
> > Chao
> >
> > On Wed, Feb 21, 2018 at 12:43 PM, Wes McKinney 
> > wrote:
> >
> > > hi Ivan and Chao,
> > >
> > > Since this work is ongoing for more than a year, it would be best to
> > > conduct an IP clearance to import it into the Apache Parquet project
> > > (http://incubator.apache.org/ip-clearance/).
> > >
> > > One or more members of the PMC will need to assist with this to
> > > prepare the documentation for the IP review and to initiate the
> > > process.
> > >
> > > It is OK for the library to not be feature complete.
> > >
> > > Thanks
> > > Wes
> > >
> > > On Fri, Feb 16, 2018 at 3:39 AM, Ivan Sadikov 
> > > wrote:
> > > > Hi Chao,
> > > >
> > > > Great to hear that you are pushing this forward.
> > > > Apologies for not forwarding the email thread earlier.
> > > >
> > > > I will try fixing some issues of the milestone 1, so that we could
> have
> > > the
> > > > read part complete.
> > > >
> > > >
> > > > Cheers,
> > > >
> > > > Ivan
> > > > On Fri, 16 Feb 2018 at 5:33 PM, Chao Sun  wrote:
> > > >
> > > >> Hi,
> > > >>
> > > >> Just joined this mailing list. Ivan and me have been working on a
> Rust
> > > >> implementation of Parquet 
> for
> > > some
> > > >> time. It still lacks many features but the eventual goal is to
> > > contribute
> > > >> it to the Apache community.
> > > >>
> > > >> I saw a few weeks ago there's a discussion between Ivan and Wes
> about
> > > this
> > > >> topic, and wonder what is the required steps to realize this. Is it
> OK
> > > if
> > > >> it is not feature complete yet?
> > > >>
> > > >> Thanks,
> > > >> Chao
> > > >>
> > >
> >
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


[jira] [Commented] (PARQUET-796) Delta Encoding is not used when dictionary enabled

2018-02-26 Thread Hans Brende (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16377464#comment-16377464
 ] 

Hans Brende commented on PARQUET-796:
-

[~rdblue] I agree with everything you said--sounds like a good strategy! 

But before this strategy is implemented, I am still forced to implement hackish 
solutions to manually turn off dictionary encoding for specific columns. 

Any idea on some sort of timeline of when your solution will be implemented? 
I'd be glad to help if there's anything I can do.

> Delta Encoding is not used when dictionary enabled
> --
>
> Key: PARQUET-796
> URL: https://issues.apache.org/jira/browse/PARQUET-796
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.9.0
>Reporter: Jakub Liska
>Priority: Critical
> Fix For: 1.9.1
>
>
> Current code doesn't enable using both Delta Encoding and Dictionary 
> Encoding. If I instantiate ParquetWriter like this : 
> {code}
> val writer = new ParquetWriter[Group](outFile, new GroupWriteSupport, codec, 
> blockSize, pageSize, dictPageSize, enableDictionary = true, true, 
> ParquetProperties.WriterVersion.PARQUET_2_0, configuration)
> {code}
> Then this piece of code : 
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultValuesWriterFactory.java#L78-L86
> Causes that DictionaryValuesWriter is used instead of the inferred 
> DeltaLongEncodingWriter. 
> The original issue is here : 
> https://github.com/apache/parquet-mr/pull/154#issuecomment-266489768



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (PARQUET-796) Delta Encoding is not used when dictionary enabled

2018-02-26 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16377382#comment-16377382
 ] 

Ryan Blue edited comment on PARQUET-796 at 2/26/18 7:06 PM:


I don't recommend using the delta long encoding because I think we need to 
update to better encodings (specifically, the zig-zag-encoding ones in [this 
branch|https://github.com/rdblue/parquet-mr/commits/encoders]).

We could definitely use a better fallback, but I don't think the solution is to 
turn off dictionary encoding. If you can use dictionary encoding to get a 
smaller size, you should. The problem is when dictionary encoding needs to test 
whether another encoding would be better. It currently tests against plain and 
uses plain. We should have it test against a delta encoding and use one.

This kind of improvement is why we added PARQUET-601. We want to be able to 
test out different ways of choosing an encoding at write time. But we do not 
want to make it so that users must specify their own encodings because we want 
Parquet to select them automatically and get the choice right. PARQUET-601 is 
about testing out strategies that we release as the defaults.


was (Author: rdblue):
I don't recommend using the delta long encoding because I think we need to 
update to better encodings (specifically, the zig-zag-encoding ones in this 
branch).

We could definitely use a better fallback, but I don't think the solution is to 
turn off dictionary encoding. If you can use dictionary encoding to get a 
smaller size, you should. The problem is when dictionary encoding needs to test 
whether another encoding would be better. It currently tests against plain and 
uses plain. We should have it test against a delta encoding and use one.

This kind of improvement is why we added PARQUET-601. We want to be able to 
test out different ways of choosing an encoding at write time. But we do not 
want to make it so that users must specify their own encodings because we want 
Parquet to select them automatically and get the choice right. PARQUET-601 is 
about testing out strategies that we release as the defaults.

> Delta Encoding is not used when dictionary enabled
> --
>
> Key: PARQUET-796
> URL: https://issues.apache.org/jira/browse/PARQUET-796
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.9.0
>Reporter: Jakub Liska
>Priority: Critical
> Fix For: 1.9.1
>
>
> Current code doesn't enable using both Delta Encoding and Dictionary 
> Encoding. If I instantiate ParquetWriter like this : 
> {code}
> val writer = new ParquetWriter[Group](outFile, new GroupWriteSupport, codec, 
> blockSize, pageSize, dictPageSize, enableDictionary = true, true, 
> ParquetProperties.WriterVersion.PARQUET_2_0, configuration)
> {code}
> Then this piece of code : 
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultValuesWriterFactory.java#L78-L86
> Causes that DictionaryValuesWriter is used instead of the inferred 
> DeltaLongEncodingWriter. 
> The original issue is here : 
> https://github.com/apache/parquet-mr/pull/154#issuecomment-266489768



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Contributing parquet-rs to Apache?

2018-02-26 Thread Ryan Blue
Chao,

>From looking at the doc that Wes sent, I think the first thing to do is to
find out who own copyright for the code that would be imported. Then we
would plan how to get grants or license agreements from copyright holders,
and finally start a resolution in the incubator to accept the code.

So I guess the question is: who owns the code and will those people or
organizations work on it once it is moved to Apache?

rb

On Fri, Feb 23, 2018 at 11:06 PM, Chao Sun  wrote:

> Thanks Wes. We are ready to start the IP clearance - can some PMC help to
> guide through the process?
>
> Best,
> Chao
>
> On Wed, Feb 21, 2018 at 12:43 PM, Wes McKinney 
> wrote:
>
> > hi Ivan and Chao,
> >
> > Since this work is ongoing for more than a year, it would be best to
> > conduct an IP clearance to import it into the Apache Parquet project
> > (http://incubator.apache.org/ip-clearance/).
> >
> > One or more members of the PMC will need to assist with this to
> > prepare the documentation for the IP review and to initiate the
> > process.
> >
> > It is OK for the library to not be feature complete.
> >
> > Thanks
> > Wes
> >
> > On Fri, Feb 16, 2018 at 3:39 AM, Ivan Sadikov 
> > wrote:
> > > Hi Chao,
> > >
> > > Great to hear that you are pushing this forward.
> > > Apologies for not forwarding the email thread earlier.
> > >
> > > I will try fixing some issues of the milestone 1, so that we could have
> > the
> > > read part complete.
> > >
> > >
> > > Cheers,
> > >
> > > Ivan
> > > On Fri, 16 Feb 2018 at 5:33 PM, Chao Sun  wrote:
> > >
> > >> Hi,
> > >>
> > >> Just joined this mailing list. Ivan and me have been working on a Rust
> > >> implementation of Parquet  for
> > some
> > >> time. It still lacks many features but the eventual goal is to
> > contribute
> > >> it to the Apache community.
> > >>
> > >> I saw a few weeks ago there's a discussion between Ivan and Wes about
> > this
> > >> topic, and wonder what is the required steps to realize this. Is it OK
> > if
> > >> it is not feature complete yet?
> > >>
> > >> Thanks,
> > >> Chao
> > >>
> >
>



-- 
Ryan Blue
Software Engineer
Netflix


Re: Rekindling the Apache Parquet monorepo discussion

2018-02-26 Thread Wes McKinney
What I would suggest to do is to create a script to forms the merged
repository (so we can verify that "git blame" will still work as
intended). If there is consensus about doing this generally, we can
then debate the structure of the repo and other details. For example,
it could be similar to Apache Thrift's repo structure


On Mon, Feb 26, 2018 at 9:32 AM, Deepak Majeti  wrote:
> +1. Compatibility benefits will be worthy.
>
> On Mon, Feb 26, 2018 at 12:48 AM, Nong Li  wrote:
>
>> I think this is a great idea. Let me know if I can help.
>>
>> On Sat, Feb 24, 2018 at 4:07 PM, Wes McKinney  wrote:
>>
>> > hi folks,
>> >
>> > in a past sync we discussed the prospect of combining all of the
>> > Parquet subprojects into a single code repo. Since there are some
>> > other programming languages which may join the fold (Rust, C#.NET), it
>> > would be beneficial to combine everything into a single repository to
>> > assist with integration and compatibility testing.
>> >
>> > Subprojects (C++, Java, etc.) could still have their own versioned
>> > releases.
>> >
>> > Is this still of interest to the community? I would be willing to
>> > assist with this effort. In theory the repo merge could be performed
>> > without loss of git history in the respective projects.
>> >
>> > - Wes
>> >
>>
>
>
>
> --
> regards,
> Deepak Majeti


Re: Rekindling the Apache Parquet monorepo discussion

2018-02-26 Thread Deepak Majeti
+1. Compatibility benefits will be worthy.

On Mon, Feb 26, 2018 at 12:48 AM, Nong Li  wrote:

> I think this is a great idea. Let me know if I can help.
>
> On Sat, Feb 24, 2018 at 4:07 PM, Wes McKinney  wrote:
>
> > hi folks,
> >
> > in a past sync we discussed the prospect of combining all of the
> > Parquet subprojects into a single code repo. Since there are some
> > other programming languages which may join the fold (Rust, C#.NET), it
> > would be beneficial to combine everything into a single repository to
> > assist with integration and compatibility testing.
> >
> > Subprojects (C++, Java, etc.) could still have their own versioned
> > releases.
> >
> > Is this still of interest to the community? I would be willing to
> > assist with this effort. In theory the repo merge could be performed
> > without loss of git history in the respective projects.
> >
> > - Wes
> >
>



-- 
regards,
Deepak Majeti


[jira] [Updated] (PARQUET-1222) Definition of float and double sort order is ambigious

2018-02-26 Thread Zoltan Ivanfi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltan Ivanfi updated PARQUET-1222:
---
Description: 
Currently parquet-format specifies the sort order for floating point numbers as 
follows:
{code:java}
   *   FLOAT - signed comparison of the represented value
   *   DOUBLE - signed comparison of the represented value
{code}
The problem is that the comparison of floating point numbers is only a partial 
ordering with strange behaviour in specific corner cases. For example, 
according to IEEE 754, -0 is neither less nor more than \+0 and comparing NaN 
to anything always returns false. This ordering is not suitable for statistics. 
Additionally, the Java implementation already uses a different (total) ordering 
that handles these cases correctly but differently than the C\+\+ 
implementations, which leads to interoperability problems.

TypeDefinedOrder for doubles and floats should be deprecated and a new 
TotalFloatingPointOrder should be introduced. The default for writing doubles 
and floats would be the new TotalFloatingPointOrder. This ordering should be 
effective and easy to implement in all programming languages.

For reading existing stats created using TypeDefinedOrder, the following 
compatibility rules should be applied:
* When looking for NaN values, min and max should be ignored.
* If the min is a NaN, it should be ignored.
* If the max is a NaN, it should be ignored.
* If the min is \+0, the row group may contain -0 values as well.
* If the max is -0, the row group may contain \+0 values as well.

  was:
Currently parquet-format specifies the sort order for floating point numbers as 
follows:
{code:java}
   *   FLOAT - signed comparison of the represented value
   *   DOUBLE - signed comparison of the represented value
{code}
The problem is that the comparison of floating point numbers is only a partial 
ordering with strange behaviour in specific corner cases. For example, 
according to IEEE 754, -0 is neither less nor more than \+0 and comparing NaN 
to anything always returns false. This ordering is not suitable for statistics. 
Additionally, the Java implementation already uses a different (total) ordering 
that handles these cases correctly but differently than the C\+\+ 
implementations, which leads to interoperability problems.

TypeDefinedOrder for doubles and floats should be deprecated and a new 
TotalFloatingPointOrder should be introduced. The default for writing doubles 
and floats would be the new TotalFloatingPointOrder. The proposed ordering is 
the following:
 * -∞
 * negative numbers in their natural order
 * -0 and +0 in the same equivalence class \(!)
 * positive numbers in their natural order
 * +∞
 * all NaN values, including the negative ones \(!), in the same equivalence 
class \(!)

This ordering should be effective and easy to implement in all programming 
languages. A visual representation of the ordering of some example values:

!ordering.png|width=640px!

For reading existing stats created using TypeDefinedOrder, the following 
compatibility rules should be applied:
* When looking for NaN values, min and max should be ignored.
* If the min is a NaN, it should be ignored.
* If the max is a NaN, it should be ignored.
* If the min is \+0, the row group may contain -0 values as well.
* If the max is -0, the row group may contain \+0 values as well.


> Definition of float and double sort order is ambigious
> --
>
> Key: PARQUET-1222
> URL: https://issues.apache.org/jira/browse/PARQUET-1222
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Zoltan Ivanfi
>Priority: Critical
> Fix For: format-2.5.0
>
>
> Currently parquet-format specifies the sort order for floating point numbers 
> as follows:
> {code:java}
>*   FLOAT - signed comparison of the represented value
>*   DOUBLE - signed comparison of the represented value
> {code}
> The problem is that the comparison of floating point numbers is only a 
> partial ordering with strange behaviour in specific corner cases. For 
> example, according to IEEE 754, -0 is neither less nor more than \+0 and 
> comparing NaN to anything always returns false. This ordering is not suitable 
> for statistics. Additionally, the Java implementation already uses a 
> different (total) ordering that handles these cases correctly but differently 
> than the C\+\+ implementations, which leads to interoperability problems.
> TypeDefinedOrder for doubles and floats should be deprecated and a new 
> TotalFloatingPointOrder should be introduced. The default for writing doubles 
> and floats would be the new TotalFloatingPointOrder. This ordering should be 
> effective and easy to implement in all programming languages.
> For reading existing stats 

[jira] [Updated] (PARQUET-1222) Definition of float and double sort order is ambigious

2018-02-26 Thread Zoltan Ivanfi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltan Ivanfi updated PARQUET-1222:
---
Attachment: (was: ordering.png)

> Definition of float and double sort order is ambigious
> --
>
> Key: PARQUET-1222
> URL: https://issues.apache.org/jira/browse/PARQUET-1222
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Zoltan Ivanfi
>Priority: Critical
> Fix For: format-2.5.0
>
>
> Currently parquet-format specifies the sort order for floating point numbers 
> as follows:
> {code:java}
>*   FLOAT - signed comparison of the represented value
>*   DOUBLE - signed comparison of the represented value
> {code}
> The problem is that the comparison of floating point numbers is only a 
> partial ordering with strange behaviour in specific corner cases. For 
> example, according to IEEE 754, -0 is neither less nor more than \+0 and 
> comparing NaN to anything always returns false. This ordering is not suitable 
> for statistics. Additionally, the Java implementation already uses a 
> different (total) ordering that handles these cases correctly but differently 
> than the C\+\+ implementations, which leads to interoperability problems.
> TypeDefinedOrder for doubles and floats should be deprecated and a new 
> TotalFloatingPointOrder should be introduced. The default for writing doubles 
> and floats would be the new TotalFloatingPointOrder. This ordering should be 
> effective and easy to implement in all programming languages.
> For reading existing stats created using TypeDefinedOrder, the following 
> compatibility rules should be applied:
> * When looking for NaN values, min and max should be ignored.
> * If the min is a NaN, it should be ignored.
> * If the max is a NaN, it should be ignored.
> * If the min is \+0, the row group may contain -0 values as well.
> * If the max is -0, the row group may contain \+0 values as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1222) Definition of float and double sort order is ambigious

2018-02-26 Thread Zoltan Ivanfi (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16376924#comment-16376924
 ] 

Zoltan Ivanfi commented on PARQUET-1222:


[~jbapple] The proposed order was intentionally different from IEEE 754 
totalOrder, because of the following rules of totalOrder:

{noformat}
3) If x and y represent the same floating-point datum:
i) If x and y have negative sign,
totalOrder(x, y) is true if and only if the exponent of x ≥ the 
exponent of y
ii) otherwise
totalOrder(x, y) is true if and only if the exponent of x ≤ the 
exponent of y.
{noformat}

This has led me to believe that different bit patterns may correspond to the 
same numeric value, in which case considering them different would be 
problematic, as it could lead to row groups being dropped or not based on what 
bit pattern was used to save the data and what bit pattern is used for looking 
them up. However, thinking more about it, it seems to me that apart from +0/-0 
and the various NaNs, all other bit patterns correspond to different numbers. 
Since the opposite assumption was the reason for the proposed order, I am 
removing it for now.

> Definition of float and double sort order is ambigious
> --
>
> Key: PARQUET-1222
> URL: https://issues.apache.org/jira/browse/PARQUET-1222
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Zoltan Ivanfi
>Priority: Critical
> Fix For: format-2.5.0
>
> Attachments: ordering.png
>
>
> Currently parquet-format specifies the sort order for floating point numbers 
> as follows:
> {code:java}
>*   FLOAT - signed comparison of the represented value
>*   DOUBLE - signed comparison of the represented value
> {code}
> The problem is that the comparison of floating point numbers is only a 
> partial ordering with strange behaviour in specific corner cases. For 
> example, according to IEEE 754, -0 is neither less nor more than \+0 and 
> comparing NaN to anything always returns false. This ordering is not suitable 
> for statistics. Additionally, the Java implementation already uses a 
> different (total) ordering that handles these cases correctly but differently 
> than the C\+\+ implementations, which leads to interoperability problems.
> TypeDefinedOrder for doubles and floats should be deprecated and a new 
> TotalFloatingPointOrder should be introduced. The default for writing doubles 
> and floats would be the new TotalFloatingPointOrder. The proposed ordering is 
> the following:
>  * -∞
>  * negative numbers in their natural order
>  * -0 and +0 in the same equivalence class \(!)
>  * positive numbers in their natural order
>  * +∞
>  * all NaN values, including the negative ones \(!), in the same equivalence 
> class \(!)
> This ordering should be effective and easy to implement in all programming 
> languages. A visual representation of the ordering of some example values:
> !ordering.png|width=640px!
> For reading existing stats created using TypeDefinedOrder, the following 
> compatibility rules should be applied:
> * When looking for NaN values, min and max should be ignored.
> * If the min is a NaN, it should be ignored.
> * If the max is a NaN, it should be ignored.
> * If the min is \+0, the row group may contain -0 values as well.
> * If the max is -0, the row group may contain \+0 values as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Issue Comment Deleted] (PARQUET-1222) Definition of float and double sort order is ambigious

2018-02-26 Thread Zoltan Ivanfi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltan Ivanfi updated PARQUET-1222:
---
Comment: was deleted

(was: [~jbapple] Indeed, the proposed order does not match the IEEE 754 
totalOrder for the following reasons:
* In case of subnormal values, different bit patterns may correspond to the 
same numeric value, similar to how one can write 0.005 as 0.5 * 10^-2 or 0.05 * 
10^-1 or ... The IEEE 754 totalOrder predicate orders these numerically equal 
values according to the exponents used in their representation. This could lead 
to row groups being dropped or not based on what bit pattern was used to save 
the data and what bit pattern is used for looking them up. Since both the 
corresponding numeric values and their presentation to the user are identical, 
this would lead to behaviour that I consider incorrect, or at least unintuitive.
* No programming languages or libraries I know of implement the totalOrder 
predicate. I am also not aware of any hardware implementation, and if there 
were one, it would still be virtually impossible to use it without proper 
exposure through a library or some very low-level wizardry.
* The definition of the totalOrder predicate looks very complicated. I think 
any implementation is prone to be complicated, error-prone and inefficient.

The proposed order, on the other hand, is sufficient for sorting and seems to 
be easy to implement. If the regular (normally hardware-accelerated) floating 
point comparison operators can decide the order, then it returns those. 
Otherwise it orders by the exponent part of the bit pattern.

However, depending on one's point of view, the proposed ordering may not really 
be considered a total ordering as both zeros are equivalent to each other and 
all NaN are equivalent to each other as well. However, it is a (strict) weak 
ordering for sure and maybe the enum should be renamed as such. However, my 
understanding is that the only difference between weak ordering and total 
ordering is whether equivalence by ordering is the same as equality by value. 
But then the totalOrder defined by IEEE 754 is not a total order either, since 
it defines NaN-s of the same bit pattern to be equivalent for ordering, while 
it also defines a NaN value to not be equal to itself at the same time. Since 
the equality defined by IEEE 754 is already inconsistent in the mathematical 
sense (because of x != x), there really is no proper equality operation to use 
for judging whether the proposed ordering is weak or total. From a more 
practical point of view, we could consider an ordering weak if it is sufficient 
for sorting but not for hashing, while a total ordering would be sufficient for 
both. This, however, depends on how we calculate the hash values and is also 
entirely out of scope for min/max values or indexes.)

> Definition of float and double sort order is ambigious
> --
>
> Key: PARQUET-1222
> URL: https://issues.apache.org/jira/browse/PARQUET-1222
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Zoltan Ivanfi
>Priority: Critical
> Fix For: format-2.5.0
>
> Attachments: ordering.png
>
>
> Currently parquet-format specifies the sort order for floating point numbers 
> as follows:
> {code:java}
>*   FLOAT - signed comparison of the represented value
>*   DOUBLE - signed comparison of the represented value
> {code}
> The problem is that the comparison of floating point numbers is only a 
> partial ordering with strange behaviour in specific corner cases. For 
> example, according to IEEE 754, -0 is neither less nor more than \+0 and 
> comparing NaN to anything always returns false. This ordering is not suitable 
> for statistics. Additionally, the Java implementation already uses a 
> different (total) ordering that handles these cases correctly but differently 
> than the C\+\+ implementations, which leads to interoperability problems.
> TypeDefinedOrder for doubles and floats should be deprecated and a new 
> TotalFloatingPointOrder should be introduced. The default for writing doubles 
> and floats would be the new TotalFloatingPointOrder. The proposed ordering is 
> the following:
>  * -∞
>  * negative numbers in their natural order
>  * -0 and +0 in the same equivalence class \(!)
>  * positive numbers in their natural order
>  * +∞
>  * all NaN values, including the negative ones \(!), in the same equivalence 
> class \(!)
> This ordering should be effective and easy to implement in all programming 
> languages. A visual representation of the ordering of some example values:
> !ordering.png|width=640px!
> For reading existing stats created using TypeDefinedOrder, the following 
> compatibility rules should be applied:
> * When 

[jira] [Created] (PARQUET-1237) Reading big texts cause OutOfMemmory Error. How to read text partialy?

2018-02-26 Thread Andrei Iatsuk (JIRA)
Andrei Iatsuk created PARQUET-1237:
--

 Summary: Reading big texts cause OutOfMemmory Error. How to read 
text partialy?
 Key: PARQUET-1237
 URL: https://issues.apache.org/jira/browse/PARQUET-1237
 Project: Parquet
  Issue Type: Bug
  Components: parquet-avro
Affects Versions: 1.8.1
 Environment: I have dataset with big strings (every record about 15 
mb) in parquet.

When I try to open all parquet parts I get OutOfMemory exception.

How can I get only headers (first 100 symbols) for each string record without 
reading all record?

 
{code:java}
  Schema avroProj = SchemaBuilder.builder()
    .record("proj").fields()
    .name("idx").type().nullable().longType().noDefault()
    .name("text").type().nullable().bytesType().noDefault()
    .endRecord();

  Configuration conf = new Configuration();

  AvroReadSupport.setRequestedProjection(conf, avroProj);
  ParquetReader parquetReader = AvroParquetReader
    .builder(new Path(filePath))
    .withConf(conf)
    .build();

  GenericRecord record = parquetReader.read(); 
  // record already have full text in RAM
  Long idx = (Long) record.get("idx");
  ByteBuffer rawText = (ByteBuffer) record.get("text");
  String header = new String(rawText.array()).substring(0, 200);
{code}
Reporter: Andrei Iatsuk


 I have dataset with big strings (every record about 15 mb) in parquet.

When I try to open all parquet parts I get OutOfMemory exception.

How can I get only headers (first 100 symbols) for each string record without 
reading all record?

 

  Schema avroProj = SchemaBuilder.builder()

    .record("proj").fields()

    .name("idx").type().nullable().longType().noDefault()

    .name("text").type().nullable().bytesType().noDefault()

    .endRecord();

  Configuration conf = new Configuration();

  AvroReadSupport.setRequestedProjection(conf, avroProj);

  ParquetReader parquetReader = AvroParquetReader

    .builder(new Path(filePath))

    .withConf(conf)

    .build();

  GenericRecord record = parquetReader.read(); // record already have full text 
in RAM

  Long idx = (Long) record.get("idx");

  ByteBuffer rawText = (ByteBuffer) record.get("text");

  String header = new String(rawText.array()).substring(0, 200);



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1225) NaN values may lead to incorrect filtering under certain circumstances

2018-02-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16376502#comment-16376502
 ] 

ASF GitHub Bot commented on PARQUET-1225:
-

zivanfi commented on issue #444: PARQUET-1225: NaN values may lead to incorrect 
filtering under certai…
URL: https://github.com/apache/parquet-cpp/pull/444#issuecomment-368420140
 
 
   I agree, thanks for your efforts!


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> NaN values may lead to incorrect filtering under certain circumstances
> --
>
> Key: PARQUET-1225
> URL: https://issues.apache.org/jira/browse/PARQUET-1225
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-cpp
>Reporter: Zoltan Ivanfi
>Assignee: Deepak Majeti
>Priority: Major
> Fix For: cpp-1.4.0
>
>
> _This JIRA describes a generic problem with floating point comparisons that 
> *most probably* affects parquet-cpp. It is known to affect Impala and by 
> taking a quick look at the parquet-cpp code it seems to affect parquet-cpp as 
> well, but it has not yet been confirmed in practice._
> For comparing float and double values for min/max stats, parquet-cpp uses the 
> C++ less-than operator (<) that returns false for comparisons involving a 
> NaN. This means that while garthering statistics, if a NaN is the smallest 
> value encountered so far (which happens to be the case after reading the 
> first value if that value is NaN), no other value can ever replace it, since 
> < will always be false. On the other hand, if NaN is not the first value, it 
> won't affect the min value. So the min value depends on the order of elements.
> If looking for specific values while reading back the data, the NaN value may 
> lead to row groups being incorrectly discarded in spite of having matching 
> rows. For details, please see the Impala bug IMPALA-6527.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)