[jira] [Commented] (PARQUET-1199) [C++] Support writing (and test reading) boolean values with RLE encoding

2018-01-19 Thread Jim Pivarski (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333004#comment-16333004
 ] 

Jim Pivarski commented on PARQUET-1199:
---

SORRY--- I misread the issue. This is plain encoding (encoding == 0, PLAIN). I 
don't know how to force it to generate RLE encoding (3).

> [C++] Support writing (and test reading) boolean values with RLE encoding
> -
>
> Key: PARQUET-1199
> URL: https://issues.apache.org/jira/browse/PARQUET-1199
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Priority: Major
> Fix For: cpp-1.5.0
>
>
> This is supported by the Parquet specification, we should ensure that we are 
> able to read such data



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1199) [C++] Support writing (and test reading) boolean values with RLE encoding

2018-01-19 Thread Jim Pivarski (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16332992#comment-16332992
 ] 

Jim Pivarski commented on PARQUET-1199:
---

I happen to have appropriate test files. The following is correct output.

[https://github.com/diana-hep/oamap/blob/master/tests/samples/record-primitives.parquet
https://github.com/diana-hep/oamap/blob/master/tests/samples/nullable-record-primitives.parquet
|https://github.com/diana-hep/oamap/blob/master/tests/samples/record-primitives.parquet]

The following is correct output from pyarrow 0.8.0.



{{>>> import pyarrow.parquet}}
{{>>> f1 = pyarrow.parquet.read_table("record-primitives.parquet", 
columns=["u1", "u4"])}}
{{>>> f2 = pyarrow.parquet.read_table("nullable-record-primitives.parquet", 
columns=["u1", "u4"])}}
{{>>> f1[0]}}
{{}}
{{chunk 0: }}
{{[}}
{{ False,}}
{{ True,}}
{{ True,}}
{{ False,}}
{{ False}}
{{]}}
{{>>> f1[1]}}
{{}}
{{chunk 0: }}
{{[}}
{{ 1,}}
{{ 2,}}
{{ 3,}}
{{ 4,}}
{{ 5}}
{{]}}
{{>>> f2[0]}}
{{}}
{{chunk 0: }}
{{[}}
{{ NA,}}
{{ True,}}
{{ NA,}}
{{ False,}}
{{ NA}}
{{]}}
{{>>> f2[1]}}
{{}}
{{chunk 0: }}
{{[}}
{{ 1,}}
{{ NA,}}
{{ NA,}}
{{ NA,}}
{{ 5}}
{{]}}

> [C++] Support writing (and test reading) boolean values with RLE encoding
> -
>
> Key: PARQUET-1199
> URL: https://issues.apache.org/jira/browse/PARQUET-1199
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Priority: Major
> Fix For: cpp-1.5.0
>
>
> This is supported by the Parquet specification, we should ensure that we are 
> able to read such data



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: parquet-mr build fail with jdk7; move to jdk8?

2018-01-19 Thread Zoltan Ivanfi
+1 for moving to Java 8

Zoltan

On Fri, Jan 19, 2018 at 5:48 PM Ryan Blue  wrote:

> We should probably make sure we have agreement from the community on this
> before we move forward; either through replies to this thread or in
> discussion at the next sync. Thanks, Gabor!
>
> On Thu, Jan 18, 2018 at 11:26 PM, Gabor Szadovszky <
> gabor.szadovs...@cloudera.com> wrote:
>
> > Created PARQUET-1198  >
> > to track moving to java8.
> >
> > Cheers,
> > Gabor
> >
> > > On 18 Jan 2018, at 22:16, Ryan Blue  wrote:
> > >
> > > I think we should move. JDK7 has been EOL for a couple years now and
> > it's a
> > > pain to even install an old JDK7 these days.
> > >
> > > rb
> > >
> > > On Thu, Jan 18, 2018 at 2:31 AM, Gabor Szadovszky <
> > > gabor.szadovs...@cloudera.com> wrote:
> > >
> > >> Hi,
> > >>
> > >> The last commit in parquet-mr master (c6764c4a0848abf1d581e22df8b33e
> > 28ee9f2ced)
> > >> does not build with jdk7 only with jdk8. We did not catch the issue
> > because
> > >> either Travis and me use jdk8 to build parquet-mr. (The source level
> in
> > the
> > >> pom.xml is set to 1.7 so both jdk7 and jdk8 should be able to build it
> > but
> > >> jdk7 fails with “invalid inferred types for T; inferred type does not
> > >> conform to declared bound(s)”.)
> > >>
> > >> I think we have 2 options:
> > >> 1. Fix the code so it compiles with both jdk7 and jdk8. In this case
> > >> Travis should be updated to check both otherwise we might have similar
> > >> issues in the future.
> > >> 2. Bump up source/target levels to java8.
> > >>
> > >> What do you think?
> > >>
> > >> Thanks,
> > >> Gabor
> > >
> > >
> > >
> > >
> > > --
> > > Ryan Blue
> > > Software Engineer
> > > Netflix
> >
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


[jira] [Created] (PARQUET-1199) [C++] Support writing (and test reading) boolean values with RLE encoding

2018-01-19 Thread Wes McKinney (JIRA)
Wes McKinney created PARQUET-1199:
-

 Summary: [C++] Support writing (and test reading) boolean values 
with RLE encoding
 Key: PARQUET-1199
 URL: https://issues.apache.org/jira/browse/PARQUET-1199
 Project: Parquet
  Issue Type: New Feature
  Components: parquet-cpp
Reporter: Wes McKinney
 Fix For: cpp-1.5.0


This is supported by the Parquet specification, we should ensure that we are 
able to read such data



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: parquet-mr build fail with jdk7; move to jdk8?

2018-01-19 Thread Ryan Blue
We should probably make sure we have agreement from the community on this
before we move forward; either through replies to this thread or in
discussion at the next sync. Thanks, Gabor!

On Thu, Jan 18, 2018 at 11:26 PM, Gabor Szadovszky <
gabor.szadovs...@cloudera.com> wrote:

> Created PARQUET-1198 
> to track moving to java8.
>
> Cheers,
> Gabor
>
> > On 18 Jan 2018, at 22:16, Ryan Blue  wrote:
> >
> > I think we should move. JDK7 has been EOL for a couple years now and
> it's a
> > pain to even install an old JDK7 these days.
> >
> > rb
> >
> > On Thu, Jan 18, 2018 at 2:31 AM, Gabor Szadovszky <
> > gabor.szadovs...@cloudera.com> wrote:
> >
> >> Hi,
> >>
> >> The last commit in parquet-mr master (c6764c4a0848abf1d581e22df8b33e
> 28ee9f2ced)
> >> does not build with jdk7 only with jdk8. We did not catch the issue
> because
> >> either Travis and me use jdk8 to build parquet-mr. (The source level in
> the
> >> pom.xml is set to 1.7 so both jdk7 and jdk8 should be able to build it
> but
> >> jdk7 fails with “invalid inferred types for T; inferred type does not
> >> conform to declared bound(s)”.)
> >>
> >> I think we have 2 options:
> >> 1. Fix the code so it compiles with both jdk7 and jdk8. In this case
> >> Travis should be updated to check both otherwise we might have similar
> >> issues in the future.
> >> 2. Bump up source/target levels to java8.
> >>
> >> What do you think?
> >>
> >> Thanks,
> >> Gabor
> >
> >
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
>
>


-- 
Ryan Blue
Software Engineer
Netflix


[jira] [Commented] (PARQUET-922) Add index pages to the format to support efficient page skipping

2018-01-19 Thread Zoltan Ivanfi (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16332194#comment-16332194
 ] 

Zoltan Ivanfi commented on PARQUET-922:
---

I was looking for a JIRA for the actual implementation in parquet-mr, but 
couldn't find it. Does such a JIRA already exist?

> Add index pages to the format to support efficient page skipping
> 
>
> Key: PARQUET-922
> URL: https://issues.apache.org/jira/browse/PARQUET-922
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Julien Le Dem
>Assignee: Marcel Kornacker
>Priority: Major
> Fix For: format-2.4.0
>
>
> When a Parquet file is sorted we can define an index consisting of the 
> boundary values for the pages of the columns sorted on as well as the offsets 
> and length of said pages in the file.
> The goal is to optimize lookup and range scan type queries, using this to 
> read only the pages containing data matching the filter.
> We'd require the pages to be aligned accross columns.
> [~marcelk] will add a link to the google doc to discuss the spec



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-41) Add bloom filters to parquet statistics

2018-01-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16331898#comment-16331898
 ] 

ASF GitHub Bot commented on PARQUET-41:
---

cjjnjust opened a new pull request #432: PARQUET-41: Add bloom filter for 
parquet
URL: https://github.com/apache/parquet-cpp/pull/432
 
 
   This is first part of bloom filter patch set, which include a bloom filter 
utility and also some unit tests.  
   Note that this patch also includes murmur3Hash original code from Austin 
Appleby. The code isn't formatted as parquet-cpp format.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add bloom filters to parquet statistics
> ---
>
> Key: PARQUET-41
> URL: https://issues.apache.org/jira/browse/PARQUET-41
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-format, parquet-mr
>Reporter: Alex Levenson
>Assignee: Ferdinand Xu
>Priority: Major
>  Labels: filter2
>
> For row groups with no dictionary, we could still produce a bloom filter. 
> This could be very useful in filtering entire row groups.
> Pull request:
> https://github.com/apache/parquet-mr/pull/215



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)