[jira] [Commented] (PARQUET-1345) [C++] It is possible to overflow a TMemoryBuffer when serializing the file metadata
[ https://issues.apache.org/jira/browse/PARQUET-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17208158#comment-17208158 ] Uwe Korn commented on PARQUET-1345: --- Turns out this was not due to many categorical columns but due to a huge number (>1mio) of RowGroups. We cannot fix this as Thrift messages are capped at 2GiB but we could probably raise a better error message. > [C++] It is possible to overflow a TMemoryBuffer when serializing the file > metadata > --- > > Key: PARQUET-1345 > URL: https://issues.apache.org/jira/browse/PARQUET-1345 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp >Reporter: Wes McKinney >Priority: Major > > I'm not sure if this is fixable, but see issue reported to Arrow: > https://github.com/apache/arrow/issues/2077 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1345) [C++] It is possible to overflow a TMemoryBuffer when serializing the file metadata
[ https://issues.apache.org/jira/browse/PARQUET-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17205420#comment-17205420 ] Uwe Korn commented on PARQUET-1345: --- One of the reasons this could appear is in the case that one has a pandas DataFrame with many categorical columns. Then the pandas metadata may become really huge. > [C++] It is possible to overflow a TMemoryBuffer when serializing the file > metadata > --- > > Key: PARQUET-1345 > URL: https://issues.apache.org/jira/browse/PARQUET-1345 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp >Reporter: Wes McKinney >Priority: Major > > I'm not sure if this is fixable, but see issue reported to Arrow: > https://github.com/apache/arrow/issues/2077 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-1825) [C++] Fix compilation error in column_io_benchmark.cc
Uwe Korn created PARQUET-1825: - Summary: [C++] Fix compilation error in column_io_benchmark.cc Key: PARQUET-1825 URL: https://issues.apache.org/jira/browse/PARQUET-1825 Project: Parquet Issue Type: Improvement Components: parquet-cpp Reporter: Uwe Korn Assignee: Uwe Korn Leftover of [https://github.com/apache/arrow/pull/6690] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1783) [C++] Parquet statistics wrong for dictionary type
[ https://issues.apache.org/jira/browse/PARQUET-1783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17029811#comment-17029811 ] Uwe Korn commented on PARQUET-1783: --- The problem is somewhere in the PARQUET C++ code as statistices are computed there. > [C++] Parquet statistics wrong for dictionary type > -- > > Key: PARQUET-1783 > URL: https://issues.apache.org/jira/browse/PARQUET-1783 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp >Affects Versions: cpp-1.6.0 >Reporter: Florian Jetter >Priority: Major > > h3. Observed behaviour > Statistics for categorical data are equivalent for all row groups and refer > to the entire {{CategoricalDtype}} instead of the data included in the row > group. > h3. Expected behaviour > The row group statistics should only include data which is part of the actual > row group, not the entire {{CategoricalDtype}} > h3. Minimal example > {code:python} > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > test_df = pd.DataFrame({"categorical": pd.Categorical(["1", "42"])}) > table = pa.Table.from_pandas(test_df) > pq.write_table( > table, > "test_parquet", > chunk_size=1, > ) > test_parquet = pq.ParquetFile("test_parquet") > test_parquet.metadata.row_group(0).column(0).statistics > {code} > {code:java} > Out[1]: > > has_min_max: True > min: 1 > max: 42 > null_count: 0 > distinct_count: 0 > num_values: 1 > physical_type: BYTE_ARRAY > logical_type: String > converted_type (legacy): UTF8 > {code} > Expected would be > {{min:1}} {{max:1}} instead of {{max: 42}} for the first row group > > Tested with > pandas==1.0.0 > pyarrow==bd08d0ecbe355b9e0de7d07e8b9ff6ccdb150e73 (current master / > essentially 0.16.0) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Moved] (PARQUET-1783) [C++] Parquet statistics wrong for dictionary type
[ https://issues.apache.org/jira/browse/PARQUET-1783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Korn moved ARROW-7732 to PARQUET-1783: -- Component/s: (was: C++) parquet-cpp Key: PARQUET-1783 (was: ARROW-7732) Affects Version/s: (was: 0.15.1) (was: 0.16.0) cpp-1.6.0 Workflow: patch-available, re-open possible (was: jira) Project: Parquet (was: Apache Arrow) > [C++] Parquet statistics wrong for dictionary type > -- > > Key: PARQUET-1783 > URL: https://issues.apache.org/jira/browse/PARQUET-1783 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp >Affects Versions: cpp-1.6.0 >Reporter: Florian Jetter >Priority: Major > > h3. Observed behaviour > Statistics for categorical data are equivalent for all row groups and refer > to the entire {{CategoricalDtype}} instead of the data included in the row > group. > h3. Expected behaviour > The row group statistics should only include data which is part of the actual > row group, not the entire {{CategoricalDtype}} > h3. Minimal example > {code:python} > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > test_df = pd.DataFrame({"categorical": pd.Categorical(["1", "42"])}) > table = pa.Table.from_pandas(test_df) > pq.write_table( > table, > "test_parquet", > chunk_size=1, > ) > test_parquet = pq.ParquetFile("test_parquet") > test_parquet.metadata.row_group(0).column(0).statistics > {code} > {code:java} > Out[1]: > > has_min_max: True > min: 1 > max: 42 > null_count: 0 > distinct_count: 0 > num_values: 1 > physical_type: BYTE_ARRAY > logical_type: String > converted_type (legacy): UTF8 > {code} > Expected would be > {{min:1}} {{max:1}} instead of {{max: 42}} for the first row group > > Tested with > pandas==1.0.0 > pyarrow==bd08d0ecbe355b9e0de7d07e8b9ff6ccdb150e73 (current master / > essentially 0.16.0) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-1779) format: Update merge script
Uwe Korn created PARQUET-1779: - Summary: format: Update merge script Key: PARQUET-1779 URL: https://issues.apache.org/jira/browse/PARQUET-1779 Project: Parquet Issue Type: Improvement Components: parquet-format Reporter: Uwe Korn Assignee: Uwe Korn Fix For: format-2.8.0 The current merge script is Python 3 incompatible, copy over the merge_script from the Arrow project which is a development that initially started from merge_parquet.py -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (PARQUET-1777) add Parquet logo vector files to repo
[ https://issues.apache.org/jira/browse/PARQUET-1777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Korn resolved PARQUET-1777. --- Fix Version/s: format-2.8.0 Resolution: Fixed Issue resolved by pull request 157 [https://github.com/apache/parquet-format/pull/157] > add Parquet logo vector files to repo > - > > Key: PARQUET-1777 > URL: https://issues.apache.org/jira/browse/PARQUET-1777 > Project: Parquet > Issue Type: Task > Components: parquet-format >Reporter: Julien Le Dem >Assignee: Julien Le Dem >Priority: Major > Labels: pull-request-available > Fix For: format-2.8.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (PARQUET-1689) [C++] Stream API: Allow for columns/rows to be skipped when reading
[ https://issues.apache.org/jira/browse/PARQUET-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Korn resolved PARQUET-1689. --- Fix Version/s: cpp-1.6.0 Resolution: Fixed Issue resolved by pull request 5797 [https://github.com/apache/arrow/pull/5797] > [C++] Stream API: Allow for columns/rows to be skipped when reading > --- > > Key: PARQUET-1689 > URL: https://issues.apache.org/jira/browse/PARQUET-1689 > Project: Parquet > Issue Type: Improvement > Components: parquet-cpp >Reporter: Gawain BOLTON >Assignee: Gawain BOLTON >Priority: Major > Labels: pull-request-available > Fix For: cpp-1.6.0 > > Time Spent: 5h 40m > Remaining Estimate: 0h > > It can be useful to be able to skip rows and/or columns when reading data. > The ColumnReader class already allows for data to be skipped. > This new StreamReader class could use this functionality to allow for users > to skip columns and rows when using the StreamReader API. > I will propose this functionality by submitting a PR. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1686) Automate site generation
[ https://issues.apache.org/jira/browse/PARQUET-1686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16963045#comment-16963045 ] Uwe Korn commented on PARQUET-1686: --- In Arrow we are using Jekyll with Github Actions to automatically deploy our site: [https://github.com/apache/arrow-site/blob/master/.github/workflows/deploy.yml] > Automate site generation > > > Key: PARQUET-1686 > URL: https://issues.apache.org/jira/browse/PARQUET-1686 > Project: Parquet > Issue Type: Improvement > Components: parquet-site >Reporter: Gabor Szadovszky >Priority: Major > Labels: documentation > > We moved our site source to [github|https://github.com/apache/parquet-site]. > It is much better than svn but still not working as it should. Currently, we > have to generate the site manually before checking in. It would be much > better if the site generation would be automatic so we can simply accept PRs > on the source files. > One option to achieve this is the [Pelican CMS > System|https://blog.getpelican.com/] as described at [.asf.yaml features for > git > repositories|https://cwiki.apache.org/confluence/display/INFRA/.asf.yaml+features+for+git+repositories#id-.asf.yamlfeaturesforgitrepositories-StaticwebsitecontentgenerationviaPelicanCMS]. > Not sure if this is the best solution though. Another solution might be to > trigger a jenkins build for the changes on master and after generating the > site with middleman commit the files to the branch asf-site. -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: Preparing for parquet-cpp 0.1
We already have https://issues.apache.org/jira/browse/PARQUET-713, closed as duplicate ;) Especially the dev scripts seem to origin from somewhere else? Is there something we have to take care of because of parquet-cpp's origin? Also I made a PR to run RAT in the CI to check the Licenses: https://github.com/apache/parquet-cpp/pull/189 Runs nicely but we still have to deal with the things Ryan mentioned. On 08.11.16 19:23, Julien Le Dem wrote: I create a jira for the release: https://issues.apache.org/jira/browse/PARQUET-774 please add blockers to that jira if they need to be in the release. On Tue, Nov 8, 2016 at 10:07 AM, Ryan Blue <rb...@netflix.com.invalid> wrote: Do you guys intend to release convenience binaries in addition to the initial source release? If so, I think you'll have to include a license/notice that includes the third party dependencies. Also, license should be used to record third-party licensed works that are included in the source distribution. The bit packing code should be in there, rather than in notice. Notice is for required third-party notices and isn't the file where third-party licensing information should be accumulated. rb On Tue, Nov 8, 2016 at 10:00 AM, Wes McKinney <wesmck...@gmail.com> wrote: I think we are ready to make a release once PARQUET-702 is merged. Is there any more licensing / NOTICE review work to do? On Fri, Nov 4, 2016 at 10:29 AM, Deepak Majeti <majeti.dee...@gmail.com> wrote: I would like to get PARQUET-764 and PARQUET-702 into the release as well. Both of them belong to me. I plan to finish PARQUET-702 by Monday. If someone can take over PARQUET-764, it will be easier. On Fri, Nov 4, 2016 at 3:04 AM, Uwe Korn <uw...@xhochy.com> wrote: Hello, given that we have reached a point parquet-cpp is working quite nicely and a minimal set of features is implemented, I would like to continue to make a release in the next days. I would wait for PARQUET-726 [1] to be merged and then setup the release scripts and ask for a vote. Is there anything else someone wants to get in before the initial release? Uwe [1] https://github.com/apache/parquet-cpp/pull/184 -- regards, Deepak Majeti -- Ryan Blue Software Engineer Netflix
Re: [VOTE] Release Apache Parquet 1.9.0 RC1
Hello Ryan, sadly I have failing tests with the RC. Seems like they are locale dependent ("," vs "."). Rerunning with LANG=en_US.UTF-8 did sadly not solve this, is there some other magic I need to provide to switch JVM locals? % cat parquet-column/target/surefire-reports/org.apache.parquet.column.statistics.TestStatistics.txt --- Test set: org.apache.parquet.column.statistics.TestStatistics --- Tests run: 9, Failures: 2, Errors: 0, Skipped: 0, Time elapsed: 0.024 sec <<< FAILURE! testFloatMinMax(org.apache.parquet.column.statistics.TestStatistics) Time elapsed: 0.01 sec <<< FAILURE! org.junit.ComparisonFailure: expected:num_nulls: 0> but was: at org.junit.Assert.assertEquals(Assert.java:125) at org.junit.Assert.assertEquals(Assert.java:147) at org.apache.parquet.column.statistics.TestStatistics.testFloatMinMax(TestStatistics.java:235) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222) at org.junit.runners.ParentRunner.run(ParentRunner.java:300) at org.apache.maven.surefire.junit4.JUnit4TestSet.execute(JUnit4TestSet.java:53) at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:123) at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:104) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(ReflectionUtils.java:164) at org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:110) at org.apache.maven.surefire.booter.SurefireStarter.invokeProvider(SurefireStarter.java:175) at org.apache.maven.surefire.booter.SurefireStarter.runSuitesInProcessWhenForked(SurefireStarter.java:107) at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:68) testDoubleMinMax(org.apache.parquet.column.statistics.TestStatistics) Time elapsed: 0 sec <<< FAILURE! org.junit.ComparisonFailure: expected:num_nulls: 0> but was: at org.junit.Assert.assertEquals(Assert.java:125) at org.junit.Assert.assertEquals(Assert.java:147) at org.apache.parquet.column.statistics.TestStatistics.testDoubleMinMax(TestStatistics.java:296) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229) at
Re: [Draft report] Apache Parquet
+1 On 13.10.16 02:43, Julien Le Dem wrote: Report from the Apache Parquet committee [Julien Le Dem] ## Description: Parquet is a standard and interoperable columnar file format for efficient analytics. ## Issues: there are no issues requiring board attention at this time ## Activity: The community has been converging toward a 1.9 release. The vote will start in the coming days. Discussion about better encoding and vectorization apis are ongoing. The parquet-cpp repo has reached a stable state and should release soon. Integration with arrow-cpp is now in the parquet-cpp repo. ## Health report: The PMC and committer list are growing. Discussion is happening on the mailing list, JIRA and regular hangout sync up. Notes are sent to the mailing list. ## PMC changes: - Currently 22 PMC members. - Wes McKinney was added to the PMC on Thu Sep 01 2016 ## Committer base changes: - Currently 25 committers. - Uwe Korn was added as a committer on Sun Sep 04 2016 ## Releases: - Last release was Format 2.3.1 on Thu Dec 17 2015 ## Mailing list activity: - Activity on the mailing list is still relatively the same - JIRAS are resolved about at the same pace they are opened. - dev@parquet.apache.org: - 172 subscribers (up 9 in the last 3 months): - 486 emails sent to list (394 in previous quarter) ## JIRA activity: - 85 JIRA tickets created in the last 3 months - 74 JIRA tickets closed/resolved in the last 3 months
Re: Python Parquet package
Sounds reasonable for me. I will then to continue to implement the missing interfaces for Parquet in pyarrow.parquet. @wesm Can you take care that we easily depend on a pinned version of parquet-cpp in pyarrow’s travis builds? Uwe > Am 21.09.2016 um 20:07 schrieb Wes McKinney <wesmck...@gmail.com>: > > I don't agree with this approach right now. Here are my reasons: > > 1. The Parquet Python integration will need to depend both on PyArrow > and the Arrow C++ libraries, so these libraries would generally need > to be developed together > > 2. PyArrow would need to define and maintain a C++ or Cython API so > that the equivalent of the current pyarrow.parquet library can access > C-level data. For example: > > https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.pyx#L31 > > Cython does permit cross-project C API access (we are already doing > cross-module Cython APi access within pyarrow). This adds additional > complexity that I think we should avoid for now. > > 3. Maintaining a separate C++ build toolchain for a Python package > adds additional maintenance and packaging burden on us > > My inclination is to keep the code where it is and make the Parquet > extension optional. > > - Wes > > On Wed, Sep 21, 2016 at 10:16 AM, Uwe Korn <uw...@xhochy.com> wrote: >> Hello, >> >> as we have moved the Arrow<->Parquet C++ integration into parquet-cpp, we >> still have to decide on how we are going to proceed with the Arrow<->Parquet >> Python integration. For the moment, it seems that the best way to go ahead >> is to pull the pyarrow.parquet module out into a separate Python package. >> From an organisational point, I'm unclear how I should proceed here. Should >> we put this in a separate repo? If so, as part of the Apache organisation? >> >> Uwe
Python Parquet package
Hello, as we have moved the Arrow<->Parquet C++ integration into parquet-cpp, we still have to decide on how we are going to proceed with the Arrow<->Parquet Python integration. For the moment, it seems that the best way to go ahead is to pull the pyarrow.parquet module out into a separate Python package. From an organisational point, I'm unclear how I should proceed here. Should we put this in a separate repo? If so, as part of the Apache organisation? Uwe
Re: Cannot load Parquet files created with parquet-cpp in Drill
Happy to report back, that this is really a parquet-cpp issue and not something in Drill. Kudos to Deepak Majeti for finding that we did not set the dictionary_page_offset in the C++ code. Uwe On 07.09.16 21:08, Kunal Khatua wrote: Hi Uwe I believe you're using the latest Apache Drill 1.8.0. From a quick look at the stack trace, it appears to be a potential bug on Drill's interpretation of dictionary encoded data. One way to verify that your C++ implementation of Parquet is correct would be to have your generated data without dictionary encoding before attempting to see if Drill can read that. Regards Kunal On Wed 7-Sep-2016 5:30:32 AM, Uwe Korn <uw...@xhochy.com> wrote: Hello, I'm currently looking at the correctness of our C++ implementation of Parquet and noticed that I cannot load these files in Drill. Although this is probably a bug in the C++ implementation, I don't understand what causes the error. Using the Java parquet-tools, I can read these files. I'm using Apache Drill 1.8.0 on OSX. I've posted the error output from Drill and the parquet file as a gist: https://gist.github.com/xhochy/d4441a5ff2025b877df43fecd4466a11 If anyone could have a short look into this and tell me why Drill cannot read the file, you would really help me to fix the parquet-cpp issues. Kind Regards, Uwe
Cannot load Parquet files created with parquet-cpp in Drill
Hello, I'm currently looking at the correctness of our C++ implementation of Parquet and noticed that I cannot load these files in Drill. Although this is probably a bug in the C++ implementation, I don't understand what causes the error. Using the Java parquet-tools, I can read these files. I'm using Apache Drill 1.8.0 on OSX. I've posted the error output from Drill and the parquet file as a gist: https://gist.github.com/xhochy/d4441a5ff2025b877df43fecd4466a11 If anyone could have a short look into this and tell me why Drill cannot read the file, you would really help me to fix the parquet-cpp issues. Kind Regards, Uwe
Re: Arrow-Parquet integration location (Was: Arrow cpp travis-ci build broken)
Hello, I'm also in favour of switching the dependency direction between Parquet and Arrow as this would avoid a lot of duplicate code in both projects as well as parquet-cpp profiting from functionality that is available in Arrow. @wesm: go ahead with the JIRAs and I'll add comments or will pick some of them up. Cheers Uwe On 07.09.16 04:41, Wes McKinney wrote: hi Julien, It makes sense to move the Parquet support for Arrow into Parquet itself and invert the dependency. I had thought that the coupling to Arrow C++'s IO subsystem might be tighter, but the connection between memory allocators and file abstractions is fairly simple: https://github.com/apache/arrow/blob/master/cpp/src/arrow/parquet/io.h I'll open appropriate JIRAs and Uwe and I can coordinate on the refactoring. The exposure of the Parquet functionality in Python should stay inside Arrow for now, but mainly because it would make developing the Python side of things much more difficult if we split things up right now. - Wes On Tue, Sep 6, 2016 at 8:27 PM, Brian Bowmanwrote: Forgive me if interposing my first post for the Apache Arrow project on this thread is incorrect procedure. What Julien proposes with each storage layer producing Arrow Record Batches is exactly how I envision it working and would certainly make Arrow integration with SAS much more palatable. This is likely true for other storage layer providers as well. Brian Bowman (SAS) On Sep 6, 2016, at 7:52 PM, Julien Le Dem wrote: Thanks Wes, No worries, I know you are on top of those things. On a side note, I was wondering if the arrow-parquet integration should be in Parquet instead. Parquet would depend on Arrow and not the other way around. Arrow provides the API and each storage layer (Parquet, Kudu, Cassandra, ...) provides a way to produce Arrow Record Batches. thoughts? On Tue, Sep 6, 2016 at 3:37 PM, Wes McKinney wrote: hi Julien, I'm very sorry about the inconvenience with this and the delay in getting it sorted out. I will triage this evening by disabling the Parquet tests in Arrow until we get the current problems under control. When we re-enable the Parquet tests in Travis CI I agree we should pin the version SHA. - Wes On Tue, Sep 6, 2016 at 5:30 PM, Julien Le Dem wrote: The Arrow cpp travis-ci build is broken right now because it depends on parquet-cpp which has changed in an incompatible way. [1] [2] (or so it looks to me) Since parquet-cpp is not released yet it is totally fine to make incompatible API changes. However, we may want to pin the Arrow to Parquet dependency (on a git sha?) to prevent cross project changes from breaking the master build. Since I'm not one of the core cpp dev on those projects I mainly want to start that conversation rather than prescribe a solution. Feel free to take this as a straw man and suggest something else. [1] https://travis-ci.org/apache/arrow/jobs/156080555 [2] https://github.com/apache/arrow/blob/2d8ec789365f3c0f82b1f22d76160d 5af150dd31/ci/travis_before_script_cpp.sh -- Julien -- Julien
Re: Reviving Parquet sync ups
+1 for a sync up and for the European friendly time. Should be able to join this time. On 01.09.16 08:02, Julien Le Dem wrote: Hi Piyush, You are totally right. Sync ups are an important part of keeping the community informed and making progress. I'll schedule one for next week. Thursday 10 am PT? Julien On Aug 31, 2016, at 18:54, Piyush Narangwrote: hi folks, A few months back we used have Parquet community sync ups via hangouts which were a nice opportunity to chat with other Parquet developers and discuss major / minor agenda items (e.g. 1.9.0 release / Parquet 2.0 etc) and things folks were working on. As it has been a while since the last sync up, I was wondering if there would there be interest in reviving this? Thanks, -- - Piyush
Re: Parquet Vectorized Read hackathon
Yes, I'm GMT +1 On 05.07.16 18:52, Julien Le Dem wrote: If there are people interested in the cpp implementation we’ll talk about that too. I’m happy to give context or help with the encoding. In particular a Parquet -> Arrow vectorized converter would be great. Are you GMT +1 ? We can schedule a 1 hour slot in the morning for discussing with remote folks in Europe. (same in afternoon if there are people joining from Asia) Julien On Jul 5, 2016, at 2:37 AM, Uwe Korn <uw...@xhochy.com> wrote: Hello, this effort is only for the parquet-mr project or would there also be some work/benefit for parquet-cpp? If so, I might join briefly in a hangout but due to the timezone shift, I probably will not be able to be awake all the time. Uwe On 02.07.16 01:01, Julien Le Dem wrote: Dear Parquet dev list, There have been efforts in several projects for vectorized reads of Parquet. We had discussed during the Parquet sync up to organize a hackathon to brainstorm and look into a shared implementation. Some projects that would benefit: - Apache Drill - Apache Arrow - Apache Spark - Presto - Apache Hive I'm planning to organize this at the Dremio office in Mountain View with optionally a hangout for people who would want to join remotely. I'm adding to the "to:" people that have expressed interest or could be interested but that's not an exhaustive list. Please respond to this email if you wish to be included. Who's interested and what dates would work between this Tuesday 7/5 and Wednesday 7/20 ?
Re: Parquet Vectorized Read hackathon
7/12 and 7/14 is ok for me. I'm mainly interested in the Path Parquet-cpp->Arrow-C++->PyArrow path for now. Encodings other than plain encoding are currently on my near future roadmap. On 05.07.16 19:00, Julien Le Dem wrote: 7/14 works better for me. For now we have for 7/14: - OK for 7/14: Jacques, Ryan, Julien - Please confirm the date (and time): Deepak, Cheng, Uwe Please send a short description of the projects you’re working on and what your particular interest is. On Jul 5, 2016, at 9:50 AM, Ryan Blue <rb...@netflix.com.INVALID> wrote: I'm in, and both 7/12 and 7/14 work for me. rb On Tue, Jul 5, 2016 at 9:15 AM, Jacques Nadeau <jacq...@apache.org> wrote: Great idea, Julien! I vote for 7/12 or 7/14 On Tue, Jul 5, 2016 at 2:37 AM, Uwe Korn <uw...@xhochy.com> wrote: Hello, this effort is only for the parquet-mr project or would there also be some work/benefit for parquet-cpp? If so, I might join briefly in a hangout but due to the timezone shift, I probably will not be able to be awake all the time. Uwe On 02.07.16 01:01, Julien Le Dem wrote: Dear Parquet dev list, There have been efforts in several projects for vectorized reads of Parquet. We had discussed during the Parquet sync up to organize a hackathon to brainstorm and look into a shared implementation. Some projects that would benefit: - Apache Drill - Apache Arrow - Apache Spark - Presto - Apache Hive I'm planning to organize this at the Dremio office in Mountain View with optionally a hangout for people who would want to join remotely. I'm adding to the "to:" people that have expressed interest or could be interested but that's not an exhaustive list. Please respond to this email if you wish to be included. Who's interested and what dates would work between this Tuesday 7/5 and Wednesday 7/20 ? -- Ryan Blue Software Engineer Netflix
Re: Parquet Vectorized Read hackathon
Hello, this effort is only for the parquet-mr project or would there also be some work/benefit for parquet-cpp? If so, I might join briefly in a hangout but due to the timezone shift, I probably will not be able to be awake all the time. Uwe On 02.07.16 01:01, Julien Le Dem wrote: Dear Parquet dev list, There have been efforts in several projects for vectorized reads of Parquet. We had discussed during the Parquet sync up to organize a hackathon to brainstorm and look into a shared implementation. Some projects that would benefit: - Apache Drill - Apache Arrow - Apache Spark - Presto - Apache Hive I'm planning to organize this at the Dremio office in Mountain View with optionally a hangout for people who would want to join remotely. I'm adding to the "to:" people that have expressed interest or could be interested but that's not an exhaustive list. Please respond to this email if you wish to be included. Who's interested and what dates would work between this Tuesday 7/5 and Wednesday 7/20 ?
List of Additions to Parquet 2
Hello, I'm currently looking at the differences between Parquet 1 and Parquet 2 to implement these versions as a switch in parquet-cpp. The only list I could find is the rather undetailed changelog [1]. Is there maybe some better list or do I need to go through the referenced changesets entries myself to find the actual differences? (If the latter is the case, I'd also make a PR afterwards that augments the documentation with some "(since version 2.0)" markings. But I'm hoping a bit that there is some blog post or so out there that could make my life easier. Thanks, Uwe [1] https://github.com/apache/parquet-format/blob/master/CHANGES.md
Re: Parquet sync uo
I'm sorry I wasn't able to join today again (traveling). We could choose an early time Pacific time to make the meeting accessible to both Asia and Europe -- I would suggest 8 or 9 AM Pacific 8 or 9 am PT would work for me (CEST), 4pm PT is just not manageable. Also: Do we have a calendar where I can see in advance when sync ups are? Currently I'm working on the Parquet integration with Arrow and on building a Python interface for libarrow-parquet. Once we have a basic working version, I will look into implementing missing features in the writer and improving general read/write performance in parquet-cpp. Uwe http://timesched.pocoo.org/?date=2016-05-11=pacific-standard-time!,de:berlin,cn:shanghai,us:new-york-city:ny I did not have much time for writing Parquet C++ development the last 6 weeks, but plan to help Uwe complete the writer implementation and work toward a more complete Apache Arrow integration (this is in progress here: https://github.com/apache/arrow/tree/master/cpp/src/arrow/parquet) Other items of immediate interest - C++ API to the file metadata (read + write) - Conda packaging for built artifacts (to make parquet-cpp easier for Python programmers to install portably when the time comes). I got Thrift C++ into conda-forge this week so this should not be hard now https://github.com/conda-forge/thrift-cpp-feedstock - Expanding column scan benchmarks (thanks Uwe for kickstarting the benchmarking effort!) - Perf improvements for the RLE decoder Thanks Wes On Wed, May 11, 2016 at 4:04 PM, Julien Le Demwrote: The actual hangout url is https://hangouts.google.com/hangouts/_/dremio.com/parquet-sync-up On Wed, May 11, 2016 at 3:57 PM, Julien Le Dem wrote: starting in 5 mins: https://plus.google.com/hangouts/_/event/parquet_sync_up On Wed, May 11, 2016 at 1:53 PM, Julien Le Dem wrote: It is happening at 4pm PT on google hangout https://plus.google.com/hangouts/_/event/parquet_sync_up (we can do a different time next time, based on timezone preferences. Afternoon is better for Asia. Morning is better for Europe) -- Julien -- Julien -- Julien
Re: Parquet sync up
Hello, due to me being in Europe, this is a very inconvenient time. Thus I rather write a longer mail instead of joining. As a bit of input, here is what I'm up to at the moment: * Write support in a basic form for parquet-cpp (no compression, fixed encodings, excessive memory usage, ..) is nearly done. I hope to open the final PR for discussion next week. * Remaining Tasks until I make the PR: * a bit of code cleanup * Going through the API again to make it consistent * Metadata for RowGroups and ColumnChunks Afterwards I would look into one of the following tasks w.r.t. parquet-cpp: * WriterProperties to specify compression, encoding, .. on a global and per-column basis. * Performance benchmarks for Write * Integration of Parquet support in Apache Arrow to use it with Python * Reduce the memory usage of the initial Writer implementation (therefore we probably need to extend the encoders a bit) If anyone else also looks into this, I'm happy to collaborate ;) Cheers Uwe On 21.04.16 00:51, Julien Le Dem wrote: It is happening at 4pm PT on google hangout https://plus.google.com/hangouts/_/event/parquet_sync_up
C++: API Documentation Style/Tool
Hello, I would start to make some API documentation comments in the parquet-cpp code I'm currently working on. By default, I would use doxygen and doxygen-style comments for the API. Are there any other suggestions/best practices you would prefer? Greetings Uwe
Retrieving the full/expanded name of a column in parquet-cpp
Hello, While using parquet-cpp, I'm trying to figure out how to reliably check which index a named/nested column is. In my example, I have a nested column "neighbours.array" but may also add at a later point some more columns with "??.array". Until now I used "column->descr()->name()" inside a loop over all columns in a RowGroup to determine if the current column is the one I want to read. This works fine for "top-level" columns but for neighbours.array, this only returns "array", the name of the primitive node in the schema description. To solve my problem: 1. Do we already have a reliable solution to determine which column index "neighbours.array" is? 2. We could add a fullname (or differently named) function to the column description. 3. We could have a map on Reader or RowGroup level that maps expanded name to index. If there is no solution yet, I'd be happy to implement 2 or 3 (or an alternative approach). My schema is as follows (generated via ParquetAvroWriter): required group com.xhochy.AdjacencyArray { required int32 id required int32 degree required group neighbours { repeated int32 array } } Greetings, Uwe