[jira] [Created] (PARQUET-479) Add regression tests to the build process
Aliaksei Sandryhaila created PARQUET-479: Summary: Add regression tests to the build process Key: PARQUET-479 URL: https://issues.apache.org/jira/browse/PARQUET-479 Project: Parquet Issue Type: Improvement Components: parquet-cpp Affects Versions: cpp-0.1 Reporter: Aliaksei Sandryhaila Assignee: Aliaksei Sandryhaila Fix For: cpp-0.1 We need to add a testing framework for unit tests, and run it as a part of each Travis CI build. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-479) Add regression tests to the build process
[ https://issues.apache.org/jira/browse/PARQUET-479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15123570#comment-15123570 ] Aliaksei Sandryhaila commented on PARQUET-479: -- In our case, regression testing will consist of running all functional unit tests on each modification. This will ensure that we do not mess up the already implemented, presumably correct functionality. > Add regression tests to the build process > - > > Key: PARQUET-479 > URL: https://issues.apache.org/jira/browse/PARQUET-479 > Project: Parquet > Issue Type: Improvement > Components: parquet-cpp >Affects Versions: cpp-0.1 >Reporter: Aliaksei Sandryhaila >Assignee: Aliaksei Sandryhaila > Fix For: cpp-0.1 > > > We need to add a testing framework for unit tests, and run it as a part of > each Travis CI build. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-479) Add regression tests to the build process
[ https://issues.apache.org/jira/browse/PARQUET-479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15123583#comment-15123583 ] Aliaksei Sandryhaila commented on PARQUET-479: -- Ah, I missed that you've already added it in .travis.yml a few days ago. > Add regression tests to the build process > - > > Key: PARQUET-479 > URL: https://issues.apache.org/jira/browse/PARQUET-479 > Project: Parquet > Issue Type: Improvement > Components: parquet-cpp >Affects Versions: cpp-0.1 >Reporter: Aliaksei Sandryhaila >Assignee: Aliaksei Sandryhaila > Fix For: cpp-0.1 > > > We need to add a testing framework for unit tests, and run it as a part of > each Travis CI build. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (PARQUET-481) Refactor and expand reader-test
Aliaksei Sandryhaila created PARQUET-481: Summary: Refactor and expand reader-test Key: PARQUET-481 URL: https://issues.apache.org/jira/browse/PARQUET-481 Project: Parquet Issue Type: Sub-task Components: parquet-cpp Affects Versions: cpp-0.1 Reporter: Aliaksei Sandryhaila Assignee: Aliaksei Sandryhaila Fix For: cpp-0.1 reader-test currently tests with a single parquet file and only verifies that we can read it, not the correctness of the output. Proposed changes: - Move reader-test.cc to a separate directory parquet-cpp/tests (in the future, all unit tests will be located there) - Expand it to work with multiple files - Add method ParquetFileReader::JsonPrint() that prints a file contents in a json format, so we can consistently compare the output with the ground truth stored in parquet-cpp/data. This method will also be more handy than DebugPrint when we start working with nested columns. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PARQUET-472) Clean up InputStream ownership semantics in ColumnReader
[ https://issues.apache.org/jira/browse/PARQUET-472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Le Dem updated PARQUET-472: -- Fix Version/s: (was: format-2.4.0) > Clean up InputStream ownership semantics in ColumnReader > > > Key: PARQUET-472 > URL: https://issues.apache.org/jira/browse/PARQUET-472 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp >Reporter: Wes McKinney >Assignee: Aliaksei Sandryhaila > > Follow-up to PARQUET-418, PARQUET-433. The {{ColumnReader}} destructor uses > {{delete}} on an {{InputStream*}}. The lifetime of this object should be > managed by a {{std::unique_ptr}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (PARQUET-482) Organize src code file structure to have a very clear folder with public headers.
Nong Li created PARQUET-482: --- Summary: Organize src code file structure to have a very clear folder with public headers. Key: PARQUET-482 URL: https://issues.apache.org/jira/browse/PARQUET-482 Project: Parquet Issue Type: Improvement Reporter: Nong Li We should organize the source code structure to have a folder where all the public headers are and nothing else. This makes it easy to understand what is the public API and which APIs needed to be looked at wrt to compatibility. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PARQUET-482) Organize src code file structure to have a very clear folder with public headers.
[ https://issues.apache.org/jira/browse/PARQUET-482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nong Li updated PARQUET-482: Component/s: parquet-cpp > Organize src code file structure to have a very clear folder with public > headers. > - > > Key: PARQUET-482 > URL: https://issues.apache.org/jira/browse/PARQUET-482 > Project: Parquet > Issue Type: Improvement > Components: parquet-cpp >Reporter: Nong Li > > We should organize the source code structure to have a folder where all the > public headers are and nothing else. This makes it easy to understand what is > the public API and which APIs needed to be looked at wrt to compatibility. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Organizing functional components and a bottom-up testing plan for parquet-cpp
hi folks, Since there's so many moving pieces with creating a full-featured Parquet reader-writer, I propose we start planning out a plan to create test fixtures and tools to enable us to develop faster. Specifically, we need to achieve maximum decoupling between functional components. Every unit of functionality should be testable without having to create actual valid Parquet test data files. Smoke tests on real data will help, but it's a band-aid solution vs approaching the problem from a rigorous test-driven perspective. To assist with the discussion, let's address the different parts of the testing process - Functional unit testing of decoupled components. We need to make a diagram of all those boxes and what is their interface with each other. For example: a column decoder only needs to know how to ask for its next data page, but not where the data page is located physically. - Integration / macro-level testing, i.e. the "everything works together" part of the problem. I don't think investing in much top-down / integration testing of the library will help us (and may actually actively hurt us) until we organize the functional components of the library in a way that everything can be tested easily in isolation. I propose that we use a Google document to help with this design process and we can learn from parquet-mr and other implementations of Parquet to help move things along. In doing this we can cross-reference existing and new JIRAs so that it's clear exactly what needs to be done for each part of the system. Let me know your thoughts. thanks, Wes
[jira] [Updated] (PARQUET-472) Clean up InputStream ownership semantics in ColumnReader
[ https://issues.apache.org/jira/browse/PARQUET-472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Le Dem updated PARQUET-472: -- Fix Version/s: cpp-0.1 > Clean up InputStream ownership semantics in ColumnReader > > > Key: PARQUET-472 > URL: https://issues.apache.org/jira/browse/PARQUET-472 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp >Reporter: Wes McKinney >Assignee: Aliaksei Sandryhaila > Fix For: cpp-0.1 > > > Follow-up to PARQUET-418, PARQUET-433. The {{ColumnReader}} destructor uses > {{delete}} on an {{InputStream*}}. The lifetime of this object should be > managed by a {{std::unique_ptr}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-481) Refactor and expand reader-test
[ https://issues.apache.org/jira/browse/PARQUET-481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15123827#comment-15123827 ] Wes McKinney commented on PARQUET-481: -- I feel very strongly about keeping the low-level unit tests next to the code they are testing and with (see for example the Kudu and Impala codebases) -- so {{foo.cc}} is accompanied by {{foo-test.cc}}. If we are doing some macro-level testing that spans the library let's create a separate directory and put those tests there. It's sort of disorganized right now -- {{reader-test.cc}} contains tests that belong in a {{column/reader-test.cc}} and {{column/scanner-test.cc}}. Let's create a common header file for unit tests containing test fixtures that can be shared amongst unit test suites > Refactor and expand reader-test > --- > > Key: PARQUET-481 > URL: https://issues.apache.org/jira/browse/PARQUET-481 > Project: Parquet > Issue Type: Sub-task > Components: parquet-cpp >Affects Versions: cpp-0.1 >Reporter: Aliaksei Sandryhaila >Assignee: Aliaksei Sandryhaila > Fix For: cpp-0.1 > > > reader-test currently tests with a single parquet file and only verifies that > we can read it, not the correctness of the output. > Proposed changes: > - Move reader-test.cc to a separate directory parquet-cpp/tests (in the > future, all unit tests will be located there) > - Expand it to work with multiple files > - Add method ParquetFileReader::JsonPrint() that prints a file contents in a > json format, so we can consistently compare the output with the ground truth > stored in parquet-cpp/data. This method will also be more handy than > DebugPrint when we start working with nested columns. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PARQUET-483) Write tests investigating failure modes with malformed encoded levels in data pages
[ https://issues.apache.org/jira/browse/PARQUET-483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated PARQUET-483: - Summary: Write tests investigating failure modes with malformed encoded levels in data pages (was: Write tests investigate failure modes with malformed encoded levels in data pages) > Write tests investigating failure modes with malformed encoded levels in data > pages > --- > > Key: PARQUET-483 > URL: https://issues.apache.org/jira/browse/PARQUET-483 > Project: Parquet > Issue Type: Test > Components: parquet-cpp >Reporter: Wes McKinney >Priority: Minor > > Follow-up to PARQUET-435. If we are not able to decode as many levels as we > expect, this should be caught and raised with a helpful error message. There > are some other hypothetical cases we should check for and verify in tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-438) Update RLE encoder/decoder modules from Impala upstream changes and adapt unit tests
[ https://issues.apache.org/jira/browse/PARQUET-438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15124538#comment-15124538 ] Deepak Majeti commented on PARQUET-438: --- [~wesmckinn] I misunderstood the parquet-mr rle-bit-packed-hybrid code with respect to the parquet spec. The impala code makes sense now. > Update RLE encoder/decoder modules from Impala upstream changes and adapt > unit tests > > > Key: PARQUET-438 > URL: https://issues.apache.org/jira/browse/PARQUET-438 > Project: Parquet > Issue Type: New Feature > Components: parquet-cpp >Reporter: Wes McKinney > > Depends on PARQUET-437 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (PARQUET-483) Write tests investigate failure modes with malformed encoded levels in data pages
Wes McKinney created PARQUET-483: Summary: Write tests investigate failure modes with malformed encoded levels in data pages Key: PARQUET-483 URL: https://issues.apache.org/jira/browse/PARQUET-483 Project: Parquet Issue Type: Test Components: parquet-cpp Reporter: Wes McKinney Priority: Minor Follow-up to PARQUET-435. If we are not able to decode as many levels as we expect, this should be caught and raised with a helpful error message. There are some other hypothetical cases we should check for and verify in tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (PARQUET-450) Small typos/issues in parquet-format documentation
[ https://issues.apache.org/jira/browse/PARQUET-450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Le Dem resolved PARQUET-450. --- Resolution: Fixed Fix Version/s: format-2.4.0 Issue resolved by pull request 36 [https://github.com/apache/parquet-format/pull/36] > Small typos/issues in parquet-format documentation > -- > > Key: PARQUET-450 > URL: https://issues.apache.org/jira/browse/PARQUET-450 > Project: Parquet > Issue Type: Task > Components: parquet-format >Reporter: Laurent Goujon >Assignee: Laurent Goujon >Priority: Minor > Fix For: format-2.4.0 > > > I noticed several typos/omissions in parquet format documentation: > - HDFS should be all uppercase (acronym) > - enncoding instead of encoding > - markdown issues > - no link to the thrift definition file > - the integer format (LE vs BE) is not specified for the file metadata > - the order of informations in a data page -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Organizing functional components and a bottom-up testing plan for parquet-cpp
Sounds good to me. at some point (later) we'll have to do some cross compatibility testing with parquet-mr as well to make sure everything is on the same page. CC'ing some folks who should probably chime in. On Fri, Jan 29, 2016 at 10:21 AM, Wes McKinneywrote: > hi folks, > > Since there's so many moving pieces with creating a full-featured Parquet > reader-writer, I propose we start planning out a plan to create test > fixtures and tools to enable us to develop faster. > > Specifically, we need to achieve maximum decoupling between functional > components. Every unit of functionality should be testable without having > to create actual valid Parquet test data files. Smoke tests on real data > will help, but it's a band-aid solution vs approaching the problem from a > rigorous test-driven perspective. > > To assist with the discussion, let's address the different parts of the > testing process > > - Functional unit testing of decoupled components. We need to make a > diagram of all those boxes and what is their interface with each other. For > example: a column decoder only needs to know how to ask for its next data > page, but not where the data page is located physically. > > - Integration / macro-level testing, i.e. the "everything works together" > part of the problem. > > I don't think investing in much top-down / integration testing of the > library will help us (and may actually actively hurt us) until we organize > the functional components of the library in a way that everything can be > tested easily in isolation. > > I propose that we use a Google document to help with this design process > and we can learn from parquet-mr and other implementations of Parquet to > help move things along. In doing this we can cross-reference existing and > new JIRAs so that it's clear exactly what needs to be done for each part of > the system. > > Let me know your thoughts. > > thanks, > Wes > -- Julien
[jira] [Updated] (PARQUET-438) Update RLE encoder/decoder modules from Impala upstream changes and adapt unit tests
[ https://issues.apache.org/jira/browse/PARQUET-438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated PARQUET-438: - Summary: Update RLE encoder/decoder modules from Impala upstream changes and adapt unit tests (was: Adapt any relevant encoding and compression unit tests from Impala) > Update RLE encoder/decoder modules from Impala upstream changes and adapt > unit tests > > > Key: PARQUET-438 > URL: https://issues.apache.org/jira/browse/PARQUET-438 > Project: Parquet > Issue Type: New Feature > Components: parquet-cpp >Reporter: Wes McKinney > > Depends on PARQUET-437 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-438) Update RLE encoder/decoder modules from Impala upstream changes and adapt unit tests
[ https://issues.apache.org/jira/browse/PARQUET-438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15124123#comment-15124123 ] Wes McKinney commented on PARQUET-438: -- See https://github.com/apache/parquet-cpp/pull/31 I'm addressing only the RLE encoding bits here. I looked into Impala's dictionary encoding facilities and it did not appear trivial to adapt that code here. I will investigate in follow up JIRAs > Update RLE encoder/decoder modules from Impala upstream changes and adapt > unit tests > > > Key: PARQUET-438 > URL: https://issues.apache.org/jira/browse/PARQUET-438 > Project: Parquet > Issue Type: New Feature > Components: parquet-cpp >Reporter: Wes McKinney > > Depends on PARQUET-437 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PARQUET-462) Implement a LevelDecoder class (like Impala) which dispatches to RLE or BIT_PACKED decoding as appropriate
[ https://issues.apache.org/jira/browse/PARQUET-462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated PARQUET-462: - Summary: Implement a LevelDecoder class (like Impala) which dispatches to RLE or BIT_PACKED decoding as appropriate (was: Create a new Level class for definition and repetition values) > Implement a LevelDecoder class (like Impala) which dispatches to RLE or > BIT_PACKED decoding as appropriate > -- > > Key: PARQUET-462 > URL: https://issues.apache.org/jira/browse/PARQUET-462 > Project: Parquet > Issue Type: Improvement > Components: parquet-cpp >Reporter: Deepak Majeti > > This class extends the RleDecoder class. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (PARQUET-432) Complete a todo for method ColumnDescriptor.compareTo()
[ https://issues.apache.org/jira/browse/PARQUET-432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved PARQUET-432. Resolution: Fixed Issue resolved by pull request 314 [https://github.com/apache/parquet-mr/pull/314] > Complete a todo for method ColumnDescriptor.compareTo() > --- > > Key: PARQUET-432 > URL: https://issues.apache.org/jira/browse/PARQUET-432 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.8.0, 1.8.1 >Reporter: Liwei Lin >Assignee: Liwei Lin >Priority: Minor > Fix For: 1.9.0 > > > The ticket proposes to consider the case *path.length < o.path.length* in, > for method ColumnDescriptor.compareTo(). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-467) Check for and raise error for deprecated BIT_PACKED encoding
[ https://issues.apache.org/jira/browse/PARQUET-467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15124182#comment-15124182 ] Wes McKinney commented on PARQUET-467: -- Per PARQUET-462 we can go ahead and implement this level encoding. Will leave this issue open until it's been tested sufficiently > Check for and raise error for deprecated BIT_PACKED encoding > > > Key: PARQUET-467 > URL: https://issues.apache.org/jira/browse/PARQUET-467 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp >Reporter: Wes McKinney >Priority: Minor > > This is implemented in Impala, but unclear how much data in the wild is > encoded in this format (deprecating according to parquet-format) with RLE as > the preferred encoding (for repetition/definition levels). At minimum we > should raise an exception if this encoding is encoutnered. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-438) Update RLE encoder/decoder modules from Impala upstream changes and adapt unit tests
[ https://issues.apache.org/jira/browse/PARQUET-438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15124173#comment-15124173 ] Wes McKinney commented on PARQUET-438: -- [~mdeepak] If you identify a specific problem with the Impala RLE encoding/decoding code, I'm sure the team will be very keen to hear about it. > Update RLE encoder/decoder modules from Impala upstream changes and adapt > unit tests > > > Key: PARQUET-438 > URL: https://issues.apache.org/jira/browse/PARQUET-438 > Project: Parquet > Issue Type: New Feature > Components: parquet-cpp >Reporter: Wes McKinney > > Depends on PARQUET-437 -- This message was sent by Atlassian JIRA (v6.3.4#6332)