[jira] [Commented] (PARQUET-1345) [C++] It is possible to overflow a TMemoryBuffer when serializing the file metadata

2020-10-05 Thread Uwe Korn (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17208158#comment-17208158
 ] 

Uwe Korn commented on PARQUET-1345:
---

Turns out this was not due to many categorical columns but due to a huge number 
(>1mio) of RowGroups. We cannot fix this as Thrift messages are capped at 2GiB 
but we could probably raise a better error message.

> [C++] It is possible to overflow a TMemoryBuffer when serializing the file 
> metadata
> ---
>
> Key: PARQUET-1345
> URL: https://issues.apache.org/jira/browse/PARQUET-1345
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Priority: Major
>
> I'm not sure if this is fixable, but see issue reported to Arrow:
> https://github.com/apache/arrow/issues/2077



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1345) [C++] It is possible to overflow a TMemoryBuffer when serializing the file metadata

2020-10-01 Thread Uwe Korn (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17205420#comment-17205420
 ] 

Uwe Korn commented on PARQUET-1345:
---

One of the reasons this could appear is in the case that one has a pandas 
DataFrame with many categorical columns. Then the pandas metadata may become 
really huge.

> [C++] It is possible to overflow a TMemoryBuffer when serializing the file 
> metadata
> ---
>
> Key: PARQUET-1345
> URL: https://issues.apache.org/jira/browse/PARQUET-1345
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Priority: Major
>
> I'm not sure if this is fixable, but see issue reported to Arrow:
> https://github.com/apache/arrow/issues/2077



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1825) [C++] Fix compilation error in column_io_benchmark.cc

2020-03-24 Thread Uwe Korn (Jira)
Uwe Korn created PARQUET-1825:
-

 Summary: [C++] Fix compilation error in column_io_benchmark.cc
 Key: PARQUET-1825
 URL: https://issues.apache.org/jira/browse/PARQUET-1825
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-cpp
Reporter: Uwe Korn
Assignee: Uwe Korn


Leftover of [https://github.com/apache/arrow/pull/6690]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1783) [C++] Parquet statistics wrong for dictionary type

2020-02-04 Thread Uwe Korn (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17029811#comment-17029811
 ] 

Uwe Korn commented on PARQUET-1783:
---

The problem is somewhere in the PARQUET C++ code as statistices are computed 
there.

> [C++] Parquet statistics wrong for dictionary type
> --
>
> Key: PARQUET-1783
> URL: https://issues.apache.org/jira/browse/PARQUET-1783
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Affects Versions: cpp-1.6.0
>Reporter: Florian Jetter
>Priority: Major
>
> h3. Observed behaviour
> Statistics for categorical data are equivalent for all row groups and refer 
> to the entire {{CategoricalDtype}} instead of the data included in the row 
> group.
> h3. Expected behaviour
> The row group statistics should only include data which is part of the actual 
> row group, not the entire {{CategoricalDtype}}
> h3. Minimal example
> {code:python}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> test_df = pd.DataFrame({"categorical": pd.Categorical(["1", "42"])})
> table = pa.Table.from_pandas(test_df)
> pq.write_table(
> table,
> "test_parquet",
> chunk_size=1,
> )
> test_parquet = pq.ParquetFile("test_parquet")
> test_parquet.metadata.row_group(0).column(0).statistics
> {code}
> {code:java}
> Out[1]:
> 
>   has_min_max: True
>   min: 1
>   max: 42
>   null_count: 0
>   distinct_count: 0
>   num_values: 1
>   physical_type: BYTE_ARRAY
>   logical_type: String
>   converted_type (legacy): UTF8
> {code}
> Expected would be
> {{min:1}} {{max:1}} instead of {{max: 42}} for the first row group
>  
> Tested with 
>  pandas==1.0.0
>  pyarrow==bd08d0ecbe355b9e0de7d07e8b9ff6ccdb150e73 (current master / 
> essentially 0.16.0)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Moved] (PARQUET-1783) [C++] Parquet statistics wrong for dictionary type

2020-02-04 Thread Uwe Korn (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Korn moved ARROW-7732 to PARQUET-1783:
--

  Component/s: (was: C++)
   parquet-cpp
  Key: PARQUET-1783  (was: ARROW-7732)
Affects Version/s: (was: 0.15.1)
   (was: 0.16.0)
   cpp-1.6.0
 Workflow: patch-available, re-open possible  (was: jira)
  Project: Parquet  (was: Apache Arrow)

> [C++] Parquet statistics wrong for dictionary type
> --
>
> Key: PARQUET-1783
> URL: https://issues.apache.org/jira/browse/PARQUET-1783
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Affects Versions: cpp-1.6.0
>Reporter: Florian Jetter
>Priority: Major
>
> h3. Observed behaviour
> Statistics for categorical data are equivalent for all row groups and refer 
> to the entire {{CategoricalDtype}} instead of the data included in the row 
> group.
> h3. Expected behaviour
> The row group statistics should only include data which is part of the actual 
> row group, not the entire {{CategoricalDtype}}
> h3. Minimal example
> {code:python}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> test_df = pd.DataFrame({"categorical": pd.Categorical(["1", "42"])})
> table = pa.Table.from_pandas(test_df)
> pq.write_table(
> table,
> "test_parquet",
> chunk_size=1,
> )
> test_parquet = pq.ParquetFile("test_parquet")
> test_parquet.metadata.row_group(0).column(0).statistics
> {code}
> {code:java}
> Out[1]:
> 
>   has_min_max: True
>   min: 1
>   max: 42
>   null_count: 0
>   distinct_count: 0
>   num_values: 1
>   physical_type: BYTE_ARRAY
>   logical_type: String
>   converted_type (legacy): UTF8
> {code}
> Expected would be
> {{min:1}} {{max:1}} instead of {{max: 42}} for the first row group
>  
> Tested with 
>  pandas==1.0.0
>  pyarrow==bd08d0ecbe355b9e0de7d07e8b9ff6ccdb150e73 (current master / 
> essentially 0.16.0)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1779) format: Update merge script

2020-01-27 Thread Uwe Korn (Jira)
Uwe Korn created PARQUET-1779:
-

 Summary: format: Update merge script
 Key: PARQUET-1779
 URL: https://issues.apache.org/jira/browse/PARQUET-1779
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-format
Reporter: Uwe Korn
Assignee: Uwe Korn
 Fix For: format-2.8.0


The current merge script is Python 3 incompatible, copy over the merge_script 
from the Arrow project which is a development that initially started from 
merge_parquet.py



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1777) add Parquet logo vector files to repo

2020-01-27 Thread Uwe Korn (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Korn resolved PARQUET-1777.
---
Fix Version/s: format-2.8.0
   Resolution: Fixed

Issue resolved by pull request 157
[https://github.com/apache/parquet-format/pull/157]

> add Parquet logo vector files to repo
> -
>
> Key: PARQUET-1777
> URL: https://issues.apache.org/jira/browse/PARQUET-1777
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-format
>Reporter: Julien Le Dem
>Assignee: Julien Le Dem
>Priority: Major
>  Labels: pull-request-available
> Fix For: format-2.8.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1689) [C++] Stream API: Allow for columns/rows to be skipped when reading

2019-11-22 Thread Uwe Korn (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Korn resolved PARQUET-1689.
---
Fix Version/s: cpp-1.6.0
   Resolution: Fixed

Issue resolved by pull request 5797
[https://github.com/apache/arrow/pull/5797]

> [C++] Stream API: Allow for columns/rows to be skipped when reading
> ---
>
> Key: PARQUET-1689
> URL: https://issues.apache.org/jira/browse/PARQUET-1689
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Gawain BOLTON
>Assignee: Gawain BOLTON
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 5h 40m
>  Remaining Estimate: 0h
>
> It can be useful to be able to skip rows and/or columns when reading data.
> The ColumnReader class already allows for data to be skipped.
> This new StreamReader class could use this functionality to allow for users 
> to skip columns and rows when using the StreamReader API.
> I will propose this functionality by submitting a PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1686) Automate site generation

2019-10-30 Thread Uwe Korn (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16963045#comment-16963045
 ] 

Uwe Korn commented on PARQUET-1686:
---

In Arrow we are using Jekyll with Github Actions to automatically deploy our 
site: 
[https://github.com/apache/arrow-site/blob/master/.github/workflows/deploy.yml]

> Automate site generation
> 
>
> Key: PARQUET-1686
> URL: https://issues.apache.org/jira/browse/PARQUET-1686
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-site
>Reporter: Gabor Szadovszky
>Priority: Major
>  Labels: documentation
>
> We moved our site source to [github|https://github.com/apache/parquet-site]. 
> It is much better than svn but still not working as it should. Currently, we 
> have to generate the site manually before checking in. It would be much 
> better if the site generation would be automatic so we can simply accept PRs 
> on the source files.
>  One option to achieve this is the [Pelican CMS 
> System|https://blog.getpelican.com/] as described at [.asf.yaml features for 
> git 
> repositories|https://cwiki.apache.org/confluence/display/INFRA/.asf.yaml+features+for+git+repositories#id-.asf.yamlfeaturesforgitrepositories-StaticwebsitecontentgenerationviaPelicanCMS].
>  Not sure if this is the best solution though. Another solution might be to 
> trigger a jenkins build for the changes on master and after generating the 
> site with middleman commit the files to the branch asf-site. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Preparing for parquet-cpp 0.1

2016-11-08 Thread Uwe Korn
We already have https://issues.apache.org/jira/browse/PARQUET-713, 
closed as duplicate ;)


Especially the dev scripts seem to origin from somewhere else? Is there 
something we have to take care of because of parquet-cpp's origin?


Also I made a PR to run RAT in the CI to check the Licenses: 
https://github.com/apache/parquet-cpp/pull/189 Runs nicely but we still 
have to deal with the things Ryan mentioned.


On 08.11.16 19:23, Julien Le Dem wrote:

I create a jira for the release:
https://issues.apache.org/jira/browse/PARQUET-774
please add blockers to that jira if they need to be in the release.


On Tue, Nov 8, 2016 at 10:07 AM, Ryan Blue <rb...@netflix.com.invalid>
wrote:


Do you guys intend to release convenience binaries in addition to the
initial source release? If so, I think you'll have to include a
license/notice that includes the third party dependencies.

Also, license should be used to record third-party licensed works that are
included in the source distribution. The bit packing code should be in
there, rather than in notice. Notice is for required third-party notices
and isn't the file where third-party licensing information should be
accumulated.

rb

On Tue, Nov 8, 2016 at 10:00 AM, Wes McKinney <wesmck...@gmail.com> wrote:


I think we are ready to make a release once PARQUET-702 is merged. Is
there any more licensing / NOTICE review work to do?

On Fri, Nov 4, 2016 at 10:29 AM, Deepak Majeti <majeti.dee...@gmail.com>
wrote:

I would like to get PARQUET-764 and PARQUET-702 into the release as
well. Both of them belong to me.
I plan to finish PARQUET-702 by Monday.
If someone can take over PARQUET-764, it will be easier.

On Fri, Nov 4, 2016 at 3:04 AM, Uwe Korn <uw...@xhochy.com> wrote:

Hello,

given that we have reached a point parquet-cpp is working quite nicely

and a

minimal set of features is implemented, I would like to continue to

make a

release in the next days. I would wait for PARQUET-726 [1] to be

merged

and

then setup the release scripts and ask for a vote.

Is there anything else someone wants to get in before the initial

release?

Uwe

[1] https://github.com/apache/parquet-cpp/pull/184




--
regards,
Deepak Majeti



--
Ryan Blue
Software Engineer
Netflix








Re: [VOTE] Release Apache Parquet 1.9.0 RC1

2016-10-14 Thread Uwe Korn

Hello Ryan,

sadly I have failing tests with the RC. Seems like they are locale 
dependent ("," vs "."). Rerunning with LANG=en_US.UTF-8 did sadly not 
solve this, is there some other magic I need to provide to switch JVM 
locals?


% cat 
parquet-column/target/surefire-reports/org.apache.parquet.column.statistics.TestStatistics.txt

---
Test set: org.apache.parquet.column.statistics.TestStatistics
---
Tests run: 9, Failures: 2, Errors: 0, Skipped: 0, Time elapsed: 0.024 
sec <<< FAILURE!
testFloatMinMax(org.apache.parquet.column.statistics.TestStatistics) 
Time elapsed: 0.01 sec  <<< FAILURE!
org.junit.ComparisonFailure: expected:num_nulls: 0> but was:

at org.junit.Assert.assertEquals(Assert.java:125)
at org.junit.Assert.assertEquals(Assert.java:147)
at 
org.apache.parquet.column.statistics.TestStatistics.testFloatMinMax(TestStatistics.java:235)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)

at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47)

at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222)
at org.junit.runners.ParentRunner.run(ParentRunner.java:300)
at 
org.apache.maven.surefire.junit4.JUnit4TestSet.execute(JUnit4TestSet.java:53)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:123)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:104)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(ReflectionUtils.java:164)
at 
org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:110)
at 
org.apache.maven.surefire.booter.SurefireStarter.invokeProvider(SurefireStarter.java:175)
at 
org.apache.maven.surefire.booter.SurefireStarter.runSuitesInProcessWhenForked(SurefireStarter.java:107)
at 
org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:68)
testDoubleMinMax(org.apache.parquet.column.statistics.TestStatistics) 
Time elapsed: 0 sec  <<< FAILURE!
org.junit.ComparisonFailure: expected:num_nulls: 0> but was:

at org.junit.Assert.assertEquals(Assert.java:125)
at org.junit.Assert.assertEquals(Assert.java:147)
at 
org.apache.parquet.column.statistics.TestStatistics.testDoubleMinMax(TestStatistics.java:296)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)

at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47)

at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229)
at 

Re: [Draft report] Apache Parquet

2016-10-13 Thread Uwe Korn

+1


On 13.10.16 02:43, Julien Le Dem wrote:

Report from the Apache Parquet committee [Julien Le Dem]

## Description:
Parquet is a standard and interoperable columnar file format for
efficient analytics.

## Issues:
there are no issues requiring board attention at this time

## Activity:
The community has been converging toward a 1.9 release. The vote will start
in the coming days. Discussion about better encoding and vectorization apis
are ongoing.
The parquet-cpp repo has reached a stable state and should release soon.
Integration with arrow-cpp is now in the parquet-cpp repo.

## Health report:
The PMC and committer list are growing. Discussion is happening on the
mailing list, JIRA and regular hangout sync up. Notes are sent to the
mailing list.

## PMC changes:

  - Currently 22 PMC members.
  - Wes McKinney was added to the PMC on Thu Sep 01 2016

## Committer base changes:

  - Currently 25 committers.
  - Uwe Korn was added as a committer on Sun Sep 04 2016

## Releases:

  - Last release was Format 2.3.1 on Thu Dec 17 2015

## Mailing list activity:

  - Activity on the mailing list is still relatively the same
  - JIRAS are resolved about at the same pace they are opened.

  - dev@parquet.apache.org:
 - 172 subscribers (up 9 in the last 3 months):
 - 486 emails sent to list (394 in previous quarter)


## JIRA activity:

  - 85 JIRA tickets created in the last 3 months
  - 74 JIRA tickets closed/resolved in the last 3 months





Re: Python Parquet package

2016-09-21 Thread Uwe Korn
Sounds reasonable for me. I will then to continue to implement the missing 
interfaces for Parquet in pyarrow.parquet. 

@wesm Can you take care that we easily depend on a pinned version of 
parquet-cpp in pyarrow’s travis builds?

Uwe

> Am 21.09.2016 um 20:07 schrieb Wes McKinney <wesmck...@gmail.com>:
> 
> I don't agree with this approach right now. Here are my reasons:
> 
> 1. The Parquet Python integration will need to depend both on PyArrow
> and the Arrow C++ libraries, so these libraries would generally need
> to be developed together
> 
> 2. PyArrow would need to define and maintain a C++ or Cython API so
> that the equivalent of the current pyarrow.parquet library can access
> C-level data. For example:
> 
> https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.pyx#L31
> 
> Cython does permit cross-project C API access (we are already doing
> cross-module Cython APi access within pyarrow). This adds additional
> complexity that I think we should avoid for now.
> 
> 3. Maintaining a separate C++ build toolchain for a Python package
> adds additional maintenance and packaging burden on us
> 
> My inclination is to keep the code where it is and make the Parquet
> extension optional.
> 
> - Wes
> 
> On Wed, Sep 21, 2016 at 10:16 AM, Uwe Korn <uw...@xhochy.com> wrote:
>> Hello,
>> 
>> as we have moved the Arrow<->Parquet C++ integration into parquet-cpp, we
>> still have to decide on how we are going to proceed with the Arrow<->Parquet
>> Python integration. For the moment, it seems that the best way to go ahead
>> is to pull the pyarrow.parquet module out into a separate Python package.
>> From an organisational point, I'm unclear how I should proceed here. Should
>> we put this in a separate repo? If so, as part of the Apache organisation?
>> 
>> Uwe



Python Parquet package

2016-09-21 Thread Uwe Korn

Hello,

as we have moved the Arrow<->Parquet C++ integration into parquet-cpp, 
we still have to decide on how we are going to proceed with the 
Arrow<->Parquet Python integration. For the moment, it seems that the 
best way to go ahead is to pull the pyarrow.parquet module out into a 
separate Python package. From an organisational point, I'm unclear how I 
should proceed here. Should we put this in a separate repo? If so, as 
part of the Apache organisation?


Uwe


Re: Cannot load Parquet files created with parquet-cpp in Drill

2016-09-07 Thread Uwe Korn
Happy to report back, that this is really a parquet-cpp issue and not 
something in Drill. Kudos to Deepak Majeti for finding that we did not 
set the dictionary_page_offset in the C++ code.


Uwe

On 07.09.16 21:08, Kunal Khatua wrote:

Hi Uwe

I believe you're using the latest Apache Drill 1.8.0. From a quick look at the 
stack trace, it appears to be a potential bug on Drill's interpretation of 
dictionary encoded data.

One way to verify that your C++ implementation of Parquet is correct would be 
to have your generated data without dictionary encoding before attempting to 
see if Drill can read that.

Regards
Kunal

On Wed 7-Sep-2016 5:30:32 AM, Uwe Korn <uw...@xhochy.com> wrote:
Hello,

I'm currently looking at the correctness of our C++ implementation of
Parquet and noticed that I cannot load these files in Drill. Although
this is probably a bug in the C++ implementation, I don't understand
what causes the error. Using the Java parquet-tools, I can read these
files. I'm using Apache Drill 1.8.0 on OSX.

I've posted the error output from Drill and the parquet file as a gist:
https://gist.github.com/xhochy/d4441a5ff2025b877df43fecd4466a11

If anyone could have a short look into this and tell me why Drill cannot
read the file, you would really help me to fix the parquet-cpp issues.

Kind Regards,

Uwe






Cannot load Parquet files created with parquet-cpp in Drill

2016-09-07 Thread Uwe Korn

Hello,

I'm currently looking at the correctness of our C++ implementation of 
Parquet and noticed that I cannot load these files in Drill. Although 
this is probably a bug in the C++ implementation, I don't understand 
what causes the error. Using the Java parquet-tools, I can read these 
files. I'm using Apache Drill 1.8.0 on OSX.


I've posted the error output from Drill and the parquet file as a gist: 
https://gist.github.com/xhochy/d4441a5ff2025b877df43fecd4466a11


If anyone could have a short look into this and tell me why Drill cannot 
read the file, you would really help me to fix the parquet-cpp issues.


Kind Regards,

Uwe



Re: Arrow-Parquet integration location (Was: Arrow cpp travis-ci build broken)

2016-09-06 Thread Uwe Korn

Hello,

I'm also in favour of switching the dependency direction between Parquet 
and Arrow as this would avoid a lot of duplicate code in both projects 
as well as parquet-cpp profiting from functionality that is available in 
Arrow.


@wesm: go ahead with the JIRAs and I'll add comments or will pick some 
of them up.


Cheers

Uwe


On 07.09.16 04:41, Wes McKinney wrote:

hi Julien,

It makes sense to move the Parquet support for Arrow into Parquet
itself and invert the dependency. I had thought that the coupling to
Arrow C++'s IO subsystem might be tighter, but the connection between
memory allocators and file abstractions is fairly simple:

https://github.com/apache/arrow/blob/master/cpp/src/arrow/parquet/io.h

I'll open appropriate JIRAs and Uwe and I can coordinate on the refactoring.

The exposure of the Parquet functionality in Python should stay inside
Arrow for now, but mainly because it would make developing the Python
side of things much more difficult if we split things up right now.

- Wes

On Tue, Sep 6, 2016 at 8:27 PM, Brian Bowman  wrote:

Forgive me if interposing my first post for the Apache Arrow project on this 
thread is incorrect procedure.

What Julien proposes with each storage layer producing Arrow Record Batches is 
exactly how I envision it working and would certainly make Arrow integration 
with SAS much more palatable.  This is likely true for other storage layer 
providers as well.

Brian Bowman (SAS)


On Sep 6, 2016, at 7:52 PM, Julien Le Dem  wrote:

Thanks Wes,
No worries, I know you are on top of those things.
On a side note, I was wondering if the arrow-parquet integration should be
in Parquet instead.
Parquet would depend on Arrow and not the other way around.
Arrow provides the API and each storage layer (Parquet, Kudu, Cassandra,
...) provides a way to produce Arrow Record Batches.
thoughts?


On Tue, Sep 6, 2016 at 3:37 PM, Wes McKinney  wrote:

hi Julien,

I'm very sorry about the inconvenience with this and the delay in
getting it sorted out. I will triage this evening by disabling the
Parquet tests in Arrow until we get the current problems under
control. When we re-enable the Parquet tests in Travis CI I agree we
should pin the version SHA.

- Wes


On Tue, Sep 6, 2016 at 5:30 PM, Julien Le Dem  wrote:
The Arrow cpp travis-ci build is broken right now because it depends on
parquet-cpp which has changed in an incompatible way. [1] [2] (or so it
looks to me)
Since parquet-cpp is not released yet it is totally fine to make
incompatible API changes.
However, we may want to pin the Arrow to Parquet dependency (on a git

sha?)

to prevent cross project changes from breaking the master build.
Since I'm not one of the core cpp dev on those projects I mainly want to
start that conversation rather than prescribe a solution. Feel free to

take

this as a straw man and suggest something else.

[1] https://travis-ci.org/apache/arrow/jobs/156080555
[2]
https://github.com/apache/arrow/blob/2d8ec789365f3c0f82b1f22d76160d

5af150dd31/ci/travis_before_script_cpp.sh


--
Julien



--
Julien




Re: Reviving Parquet sync ups

2016-09-01 Thread Uwe Korn
+1 for a sync up and for the European friendly time. Should be able to 
join this time.



On 01.09.16 08:02, Julien Le Dem wrote:

Hi Piyush,
You are totally right. Sync ups are an important part of keeping the community 
informed and making progress.
I'll schedule one for next week.
Thursday 10 am PT?

Julien


On Aug 31, 2016, at 18:54, Piyush Narang  wrote:

hi folks,

A few months back we used have Parquet community sync ups via hangouts
which were a nice opportunity to chat with other Parquet developers and
discuss major / minor agenda items (e.g. 1.9.0 release / Parquet 2.0 etc)
and things folks were working on. As it has been a while since the last
sync up, I was wondering if there would there be interest in reviving this?

Thanks,

--
- Piyush




Re: Parquet Vectorized Read hackathon

2016-07-06 Thread Uwe Korn

Yes, I'm GMT +1


On 05.07.16 18:52, Julien Le Dem wrote:

If there are people interested in the cpp implementation we’ll talk about that 
too.
I’m happy to give context or help with the encoding. In particular a Parquet -> 
Arrow vectorized converter would be great.
Are you GMT +1 ?
We can schedule a 1 hour slot in the morning for discussing with remote folks 
in Europe. (same in afternoon if there are people joining from Asia)
Julien


On Jul 5, 2016, at 2:37 AM, Uwe Korn <uw...@xhochy.com> wrote:

Hello,

this effort is only for the parquet-mr project or would there also be some 
work/benefit for parquet-cpp? If so, I might join briefly in a hangout but due 
to the timezone shift, I probably will not be able to be awake all the time.

Uwe

On 02.07.16 01:01, Julien Le Dem wrote:

Dear Parquet dev list,
There have been efforts in several projects for vectorized reads of Parquet.
We had discussed during the Parquet sync up to organize a hackathon to
brainstorm and look into a shared implementation.
Some projects that would benefit:
  - Apache Drill
  - Apache Arrow
  - Apache Spark
  - Presto
  - Apache Hive

I'm planning to organize this at the Dremio office in Mountain View with
optionally a hangout for people who would want to join remotely.
I'm adding to the "to:" people that have expressed interest or could be
interested but that's not an exhaustive list. Please respond to this email
if you wish to be included.
Who's interested and what dates would work between this Tuesday 7/5 and
Wednesday 7/20 ?





Re: Parquet Vectorized Read hackathon

2016-07-06 Thread Uwe Korn
7/12 and 7/14 is ok for me. I'm mainly interested in the Path 
Parquet-cpp->Arrow-C++->PyArrow path for now. Encodings other than plain 
encoding are currently on my near future roadmap.



On 05.07.16 19:00, Julien Le Dem wrote:

7/14 works better for me.
For now we have for 7/14:

- OK for 7/14: Jacques, Ryan, Julien
- Please confirm the date (and time): Deepak, Cheng, Uwe

Please send a short description of the projects you’re working on and what your 
particular interest is.




On Jul 5, 2016, at 9:50 AM, Ryan Blue <rb...@netflix.com.INVALID> wrote:

I'm in, and both 7/12 and 7/14 work for me.

rb

On Tue, Jul 5, 2016 at 9:15 AM, Jacques Nadeau <jacq...@apache.org> wrote:


Great idea, Julien!

I vote for 7/12 or 7/14

On Tue, Jul 5, 2016 at 2:37 AM, Uwe Korn <uw...@xhochy.com> wrote:


Hello,

this effort is only for the parquet-mr project or would there also be

some

work/benefit for parquet-cpp? If so, I might join briefly in a hangout

but

due to the timezone shift, I probably will not be able to be awake all

the

time.

Uwe


On 02.07.16 01:01, Julien Le Dem wrote:


Dear Parquet dev list,
There have been efforts in several projects for vectorized reads of
Parquet.
We had discussed during the Parquet sync up to organize a hackathon to
brainstorm and look into a shared implementation.
Some projects that would benefit:
  - Apache Drill
  - Apache Arrow
  - Apache Spark
  - Presto
  - Apache Hive

I'm planning to organize this at the Dremio office in Mountain View with
optionally a hangout for people who would want to join remotely.
I'm adding to the "to:" people that have expressed interest or could be
interested but that's not an exhaustive list. Please respond to this

email

if you wish to be included.
Who's interested and what dates would work between this Tuesday 7/5 and
Wednesday 7/20 ?





--
Ryan Blue
Software Engineer
Netflix




Re: Parquet Vectorized Read hackathon

2016-07-05 Thread Uwe Korn

Hello,

this effort is only for the parquet-mr project or would there also be 
some work/benefit for parquet-cpp? If so, I might join briefly in a 
hangout but due to the timezone shift, I probably will not be able to be 
awake all the time.


Uwe

On 02.07.16 01:01, Julien Le Dem wrote:

Dear Parquet dev list,
There have been efforts in several projects for vectorized reads of Parquet.
We had discussed during the Parquet sync up to organize a hackathon to
brainstorm and look into a shared implementation.
Some projects that would benefit:
  - Apache Drill
  - Apache Arrow
  - Apache Spark
  - Presto
  - Apache Hive

I'm planning to organize this at the Dremio office in Mountain View with
optionally a hangout for people who would want to join remotely.
I'm adding to the "to:" people that have expressed interest or could be
interested but that's not an exhaustive list. Please respond to this email
if you wish to be included.
Who's interested and what dates would work between this Tuesday 7/5 and
Wednesday 7/20 ?





List of Additions to Parquet 2

2016-06-16 Thread Uwe Korn

Hello,

I'm currently looking at the differences between Parquet 1 and Parquet 2 
to implement these versions as a switch in parquet-cpp. The only list I 
could find is the rather undetailed changelog [1]. Is there maybe some 
better list or do I need to go through the referenced changesets entries 
myself to find the actual differences? (If the latter is the case, I'd 
also make a PR afterwards that augments the documentation with some 
"(since version 2.0)" markings. But I'm hoping a bit that there is some 
blog post or so out there that could make my life easier.


Thanks,

Uwe

[1] https://github.com/apache/parquet-format/blob/master/CHANGES.md



Re: Parquet sync uo

2016-05-12 Thread Uwe Korn



I'm sorry I wasn't able to join today again (traveling). We could
choose an early time Pacific time to make the meeting accessible to
both Asia and Europe -- I would suggest 8 or 9 AM Pacific


8 or 9 am PT would work for me (CEST), 4pm PT is just not manageable.
Also: Do we have a calendar where I can see in advance when sync ups are?

Currently I'm working on the Parquet integration with Arrow and on 
building a Python interface for libarrow-parquet. Once we have a basic 
working version, I will look into implementing missing features in the 
writer and improving general read/write performance in parquet-cpp.


Uwe

http://timesched.pocoo.org/?date=2016-05-11=pacific-standard-time!,de:berlin,cn:shanghai,us:new-york-city:ny

I did not have much time for writing Parquet C++ development the last
6 weeks, but plan to help Uwe complete the writer implementation and
work toward a more complete Apache Arrow integration (this is in
progress here: 
https://github.com/apache/arrow/tree/master/cpp/src/arrow/parquet)

Other items of immediate interest

- C++ API to the file metadata (read + write)
- Conda packaging for built artifacts (to make parquet-cpp easier for
Python programmers to install portably when the time comes). I got
Thrift C++ into conda-forge this week so this should not be hard now
https://github.com/conda-forge/thrift-cpp-feedstock
- Expanding column scan benchmarks (thanks Uwe for kickstarting the
benchmarking effort!)
- Perf improvements for the RLE decoder

Thanks
Wes

On Wed, May 11, 2016 at 4:04 PM, Julien Le Dem  wrote:

The actual hangout url is
https://hangouts.google.com/hangouts/_/dremio.com/parquet-sync-up

On Wed, May 11, 2016 at 3:57 PM, Julien Le Dem  wrote:


starting in 5 mins:
https://plus.google.com/hangouts/_/event/parquet_sync_up

On Wed, May 11, 2016 at 1:53 PM, Julien Le Dem  wrote:


It is happening at 4pm PT on google hangout
https://plus.google.com/hangouts/_/event/parquet_sync_up

(we can do a different time next time, based on timezone preferences.
Afternoon is better for Asia. Morning is better for Europe)

--
Julien




--
Julien




--
Julien




Re: Parquet sync up

2016-04-21 Thread Uwe Korn

Hello,

due to me being in Europe, this is a very inconvenient time. Thus I 
rather write a longer mail instead of joining. As a bit of input, here 
is what I'm up to at the moment:


 * Write support in a basic form for parquet-cpp (no compression, fixed 
encodings, excessive memory usage, ..) is nearly done. I hope to open 
the final PR for discussion next week.

 * Remaining Tasks until I make the PR:
   * a bit of code cleanup
   * Going through the API again to make it consistent
   * Metadata for RowGroups and ColumnChunks

Afterwards I would look into one of the following tasks w.r.t. parquet-cpp:
 * WriterProperties to specify compression, encoding, .. on a global 
and per-column basis.

 * Performance benchmarks for Write
 * Integration of Parquet support in Apache Arrow to use it with Python
 * Reduce the memory usage of the initial Writer implementation 
(therefore we probably need to extend the encoders a bit)


If anyone else also looks into this, I'm happy to collaborate ;)

Cheers
Uwe

On 21.04.16 00:51, Julien Le Dem wrote:

It is happening at 4pm PT on google hangout
https://plus.google.com/hangouts/_/event/parquet_sync_up






C++: API Documentation Style/Tool

2016-04-20 Thread Uwe Korn

Hello,

I would start to make some API documentation comments in the parquet-cpp 
code I'm currently working on. By default, I would use doxygen and 
doxygen-style comments for the API. Are there any other suggestions/best 
practices you would prefer?


Greetings
Uwe


Retrieving the full/expanded name of a column in parquet-cpp

2016-03-19 Thread Uwe Korn

Hello,

While using parquet-cpp, I'm trying to figure out how to reliably check 
which index a named/nested column is. In my example, I have a nested 
column "neighbours.array" but may also add at a later point some more 
columns with "??.array".


Until now I used "column->descr()->name()" inside a loop over all 
columns in a RowGroup to determine if the current column is the one I 
want to read. This works fine for "top-level" columns but for 
neighbours.array, this only returns "array", the name of the primitive 
node in the schema description.


To solve my problem:

1. Do we already have a reliable solution to determine which column
   index "neighbours.array" is?
2. We could add a fullname (or differently named) function to the
   column description.
3. We could have a map on Reader or RowGroup level that maps expanded
   name to index.

If there is no solution yet, I'd be happy to implement 2 or 3 (or an 
alternative approach).


My schema is as follows (generated via ParquetAvroWriter):

   required group com.xhochy.AdjacencyArray {
  required int32 id
  required int32 degree
  required group neighbours {
repeated int32 array
  }
   }

Greetings,
Uwe