[jira] [Created] (PARQUET-479) Add regression tests to the build process

2016-01-29 Thread Aliaksei Sandryhaila (JIRA)
Aliaksei Sandryhaila created PARQUET-479:


 Summary: Add regression tests to the build process
 Key: PARQUET-479
 URL: https://issues.apache.org/jira/browse/PARQUET-479
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-cpp
Affects Versions: cpp-0.1
Reporter: Aliaksei Sandryhaila
Assignee: Aliaksei Sandryhaila
 Fix For: cpp-0.1


We need to add a testing framework for unit tests, and run it as a part of each 
Travis CI build.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-479) Add regression tests to the build process

2016-01-29 Thread Aliaksei Sandryhaila (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15123570#comment-15123570
 ] 

Aliaksei Sandryhaila commented on PARQUET-479:
--

In our case, regression testing will consist of running all functional unit 
tests on each modification. This will ensure that we do not mess up the already 
implemented, presumably correct functionality.

> Add regression tests to the build process
> -
>
> Key: PARQUET-479
> URL: https://issues.apache.org/jira/browse/PARQUET-479
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Affects Versions: cpp-0.1
>Reporter: Aliaksei Sandryhaila
>Assignee: Aliaksei Sandryhaila
> Fix For: cpp-0.1
>
>
> We need to add a testing framework for unit tests, and run it as a part of 
> each Travis CI build.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-479) Add regression tests to the build process

2016-01-29 Thread Aliaksei Sandryhaila (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15123583#comment-15123583
 ] 

Aliaksei Sandryhaila commented on PARQUET-479:
--

Ah, I missed that you've already added it in .travis.yml a few days ago.

> Add regression tests to the build process
> -
>
> Key: PARQUET-479
> URL: https://issues.apache.org/jira/browse/PARQUET-479
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Affects Versions: cpp-0.1
>Reporter: Aliaksei Sandryhaila
>Assignee: Aliaksei Sandryhaila
> Fix For: cpp-0.1
>
>
> We need to add a testing framework for unit tests, and run it as a part of 
> each Travis CI build.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PARQUET-481) Refactor and expand reader-test

2016-01-29 Thread Aliaksei Sandryhaila (JIRA)
Aliaksei Sandryhaila created PARQUET-481:


 Summary: Refactor and expand reader-test
 Key: PARQUET-481
 URL: https://issues.apache.org/jira/browse/PARQUET-481
 Project: Parquet
  Issue Type: Sub-task
  Components: parquet-cpp
Affects Versions: cpp-0.1
Reporter: Aliaksei Sandryhaila
Assignee: Aliaksei Sandryhaila
 Fix For: cpp-0.1


reader-test currently tests with a single parquet file and only verifies that 
we can read it, not the correctness of the output.

Proposed changes:
- Move reader-test.cc to a separate directory parquet-cpp/tests (in the future, 
all unit tests will be located there)
- Expand it to work with multiple files
- Add method ParquetFileReader::JsonPrint() that prints a file contents in a 
json format, so we can consistently compare the output with the ground truth 
stored in parquet-cpp/data. This method will also be more handy than DebugPrint 
when we start working with nested columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PARQUET-472) Clean up InputStream ownership semantics in ColumnReader

2016-01-29 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem updated PARQUET-472:
--
Fix Version/s: (was: format-2.4.0)

> Clean up InputStream ownership semantics in ColumnReader
> 
>
> Key: PARQUET-472
> URL: https://issues.apache.org/jira/browse/PARQUET-472
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Assignee: Aliaksei Sandryhaila
>
> Follow-up to PARQUET-418, PARQUET-433. The {{ColumnReader}} destructor uses 
> {{delete}} on an {{InputStream*}}. The lifetime of this object should be 
> managed by a {{std::unique_ptr}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PARQUET-482) Organize src code file structure to have a very clear folder with public headers.

2016-01-29 Thread Nong Li (JIRA)
Nong Li created PARQUET-482:
---

 Summary: Organize src code file structure to have a very clear 
folder with public headers.
 Key: PARQUET-482
 URL: https://issues.apache.org/jira/browse/PARQUET-482
 Project: Parquet
  Issue Type: Improvement
Reporter: Nong Li


We should organize the source code structure to have a folder where all the 
public headers are and nothing else. This makes it easy to understand what is 
the public API and which APIs needed to be looked at wrt to compatibility.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PARQUET-482) Organize src code file structure to have a very clear folder with public headers.

2016-01-29 Thread Nong Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nong Li updated PARQUET-482:

Component/s: parquet-cpp

> Organize src code file structure to have a very clear folder with public 
> headers.
> -
>
> Key: PARQUET-482
> URL: https://issues.apache.org/jira/browse/PARQUET-482
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Nong Li
>
> We should organize the source code structure to have a folder where all the 
> public headers are and nothing else. This makes it easy to understand what is 
> the public API and which APIs needed to be looked at wrt to compatibility.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Organizing functional components and a bottom-up testing plan for parquet-cpp

2016-01-29 Thread Wes McKinney
hi folks,

Since there's so many moving pieces with creating a full-featured Parquet
reader-writer, I propose we start planning out a plan to create test
fixtures and tools to enable us to develop faster.

Specifically, we need to achieve maximum decoupling between functional
components. Every unit of functionality should be testable without having
to create actual valid Parquet test data files. Smoke tests on real data
will help, but it's a band-aid solution vs approaching the problem from a
rigorous test-driven perspective.

To assist with the discussion, let's address the different parts of the
testing process

- Functional unit testing of decoupled components. We need to make a
diagram of all those boxes and what is their interface with each other. For
example: a column decoder only needs to know how to ask for its next data
page, but not where the data page is located physically.

- Integration / macro-level testing, i.e. the "everything works together"
part of the problem.

I don't think investing in much top-down / integration testing of the
library will help us (and may actually actively hurt us) until we organize
the functional components of the library in a way that everything can be
tested easily in isolation.

I propose that we use a Google document to help with this design process
and we can learn from parquet-mr and other implementations of Parquet to
help move things along. In doing this we can cross-reference existing and
new JIRAs so that it's clear exactly what needs to be done for each part of
the system.

Let me know your thoughts.

thanks,
Wes


[jira] [Updated] (PARQUET-472) Clean up InputStream ownership semantics in ColumnReader

2016-01-29 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem updated PARQUET-472:
--
Fix Version/s: cpp-0.1

> Clean up InputStream ownership semantics in ColumnReader
> 
>
> Key: PARQUET-472
> URL: https://issues.apache.org/jira/browse/PARQUET-472
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Assignee: Aliaksei Sandryhaila
> Fix For: cpp-0.1
>
>
> Follow-up to PARQUET-418, PARQUET-433. The {{ColumnReader}} destructor uses 
> {{delete}} on an {{InputStream*}}. The lifetime of this object should be 
> managed by a {{std::unique_ptr}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-481) Refactor and expand reader-test

2016-01-29 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15123827#comment-15123827
 ] 

Wes McKinney commented on PARQUET-481:
--

I feel very strongly about keeping the low-level unit tests next to the code 
they are testing and with (see for example the Kudu and Impala codebases) -- so 
{{foo.cc}} is accompanied by {{foo-test.cc}}. If we are doing some macro-level 
testing that spans the library let's create a separate directory and put those 
tests there. It's sort of disorganized right now -- {{reader-test.cc}} contains 
tests that belong in a {{column/reader-test.cc}} and 
{{column/scanner-test.cc}}. Let's create a common header file for unit tests 
containing test fixtures that can be shared amongst unit test suites

> Refactor and expand reader-test
> ---
>
> Key: PARQUET-481
> URL: https://issues.apache.org/jira/browse/PARQUET-481
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-cpp
>Affects Versions: cpp-0.1
>Reporter: Aliaksei Sandryhaila
>Assignee: Aliaksei Sandryhaila
> Fix For: cpp-0.1
>
>
> reader-test currently tests with a single parquet file and only verifies that 
> we can read it, not the correctness of the output.
> Proposed changes:
> - Move reader-test.cc to a separate directory parquet-cpp/tests (in the 
> future, all unit tests will be located there)
> - Expand it to work with multiple files
> - Add method ParquetFileReader::JsonPrint() that prints a file contents in a 
> json format, so we can consistently compare the output with the ground truth 
> stored in parquet-cpp/data. This method will also be more handy than 
> DebugPrint when we start working with nested columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PARQUET-483) Write tests investigating failure modes with malformed encoded levels in data pages

2016-01-29 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated PARQUET-483:
-
Summary: Write tests investigating failure modes with malformed encoded 
levels in data pages  (was: Write tests investigate failure modes with 
malformed encoded levels in data pages)

> Write tests investigating failure modes with malformed encoded levels in data 
> pages
> ---
>
> Key: PARQUET-483
> URL: https://issues.apache.org/jira/browse/PARQUET-483
> Project: Parquet
>  Issue Type: Test
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Priority: Minor
>
> Follow-up to PARQUET-435. If we are not able to decode as many levels as we 
> expect, this should be caught and raised with a helpful error message. There 
> are some other hypothetical cases we should check for and verify in tests. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-438) Update RLE encoder/decoder modules from Impala upstream changes and adapt unit tests

2016-01-29 Thread Deepak Majeti (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15124538#comment-15124538
 ] 

Deepak Majeti commented on PARQUET-438:
---

[~wesmckinn] I misunderstood the parquet-mr rle-bit-packed-hybrid code with 
respect to the parquet spec. The impala code makes sense now. 

> Update RLE encoder/decoder modules from Impala upstream changes and adapt 
> unit tests
> 
>
> Key: PARQUET-438
> URL: https://issues.apache.org/jira/browse/PARQUET-438
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: Wes McKinney
>
> Depends on PARQUET-437



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PARQUET-483) Write tests investigate failure modes with malformed encoded levels in data pages

2016-01-29 Thread Wes McKinney (JIRA)
Wes McKinney created PARQUET-483:


 Summary: Write tests investigate failure modes with malformed 
encoded levels in data pages
 Key: PARQUET-483
 URL: https://issues.apache.org/jira/browse/PARQUET-483
 Project: Parquet
  Issue Type: Test
  Components: parquet-cpp
Reporter: Wes McKinney
Priority: Minor


Follow-up to PARQUET-435. If we are not able to decode as many levels as we 
expect, this should be caught and raised with a helpful error message. There 
are some other hypothetical cases we should check for and verify in tests. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PARQUET-450) Small typos/issues in parquet-format documentation

2016-01-29 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem resolved PARQUET-450.
---
   Resolution: Fixed
Fix Version/s: format-2.4.0

Issue resolved by pull request 36
[https://github.com/apache/parquet-format/pull/36]

> Small typos/issues in parquet-format documentation
> --
>
> Key: PARQUET-450
> URL: https://issues.apache.org/jira/browse/PARQUET-450
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-format
>Reporter: Laurent Goujon
>Assignee: Laurent Goujon
>Priority: Minor
> Fix For: format-2.4.0
>
>
> I noticed several typos/omissions in parquet format documentation:
> - HDFS should be all uppercase (acronym)
> - enncoding instead of encoding
> - markdown issues
> - no link to the thrift definition file
> - the integer format (LE vs BE) is not specified for the file metadata
> - the order of informations in a data page



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Organizing functional components and a bottom-up testing plan for parquet-cpp

2016-01-29 Thread Julien Le Dem
Sounds good to me.
at some point (later) we'll have to do some cross compatibility testing
with parquet-mr as well to make sure everything is on the same page.
CC'ing some folks who should probably chime in.


On Fri, Jan 29, 2016 at 10:21 AM, Wes McKinney  wrote:

> hi folks,
>
> Since there's so many moving pieces with creating a full-featured Parquet
> reader-writer, I propose we start planning out a plan to create test
> fixtures and tools to enable us to develop faster.
>
> Specifically, we need to achieve maximum decoupling between functional
> components. Every unit of functionality should be testable without having
> to create actual valid Parquet test data files. Smoke tests on real data
> will help, but it's a band-aid solution vs approaching the problem from a
> rigorous test-driven perspective.
>
> To assist with the discussion, let's address the different parts of the
> testing process
>
> - Functional unit testing of decoupled components. We need to make a
> diagram of all those boxes and what is their interface with each other. For
> example: a column decoder only needs to know how to ask for its next data
> page, but not where the data page is located physically.
>
> - Integration / macro-level testing, i.e. the "everything works together"
> part of the problem.
>
> I don't think investing in much top-down / integration testing of the
> library will help us (and may actually actively hurt us) until we organize
> the functional components of the library in a way that everything can be
> tested easily in isolation.
>
> I propose that we use a Google document to help with this design process
> and we can learn from parquet-mr and other implementations of Parquet to
> help move things along. In doing this we can cross-reference existing and
> new JIRAs so that it's clear exactly what needs to be done for each part of
> the system.
>
> Let me know your thoughts.
>
> thanks,
> Wes
>



-- 
Julien


[jira] [Updated] (PARQUET-438) Update RLE encoder/decoder modules from Impala upstream changes and adapt unit tests

2016-01-29 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated PARQUET-438:
-
Summary: Update RLE encoder/decoder modules from Impala upstream changes 
and adapt unit tests  (was: Adapt any relevant encoding and compression unit 
tests from Impala)

> Update RLE encoder/decoder modules from Impala upstream changes and adapt 
> unit tests
> 
>
> Key: PARQUET-438
> URL: https://issues.apache.org/jira/browse/PARQUET-438
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: Wes McKinney
>
> Depends on PARQUET-437



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-438) Update RLE encoder/decoder modules from Impala upstream changes and adapt unit tests

2016-01-29 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15124123#comment-15124123
 ] 

Wes McKinney commented on PARQUET-438:
--

See https://github.com/apache/parquet-cpp/pull/31

I'm addressing only the RLE encoding bits here. I looked into Impala's 
dictionary encoding facilities and it did not appear trivial to adapt that code 
here. I will investigate in follow up JIRAs

> Update RLE encoder/decoder modules from Impala upstream changes and adapt 
> unit tests
> 
>
> Key: PARQUET-438
> URL: https://issues.apache.org/jira/browse/PARQUET-438
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: Wes McKinney
>
> Depends on PARQUET-437



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PARQUET-462) Implement a LevelDecoder class (like Impala) which dispatches to RLE or BIT_PACKED decoding as appropriate

2016-01-29 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated PARQUET-462:
-
Summary: Implement a LevelDecoder class (like Impala) which dispatches to 
RLE or BIT_PACKED decoding as appropriate  (was: Create a new Level class for 
definition and repetition values)

> Implement a LevelDecoder class (like Impala) which dispatches to RLE or 
> BIT_PACKED decoding as appropriate
> --
>
> Key: PARQUET-462
> URL: https://issues.apache.org/jira/browse/PARQUET-462
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Deepak Majeti
>
> This class extends the RleDecoder class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PARQUET-432) Complete a todo for method ColumnDescriptor.compareTo()

2016-01-29 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved PARQUET-432.

Resolution: Fixed

Issue resolved by pull request 314
[https://github.com/apache/parquet-mr/pull/314]

> Complete a todo for method ColumnDescriptor.compareTo()
> ---
>
> Key: PARQUET-432
> URL: https://issues.apache.org/jira/browse/PARQUET-432
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.8.0, 1.8.1
>Reporter: Liwei Lin
>Assignee: Liwei Lin
>Priority: Minor
> Fix For: 1.9.0
>
>
> The ticket proposes to consider the case *path.length < o.path.length* in, 
> for method ColumnDescriptor.compareTo().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-467) Check for and raise error for deprecated BIT_PACKED encoding

2016-01-29 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15124182#comment-15124182
 ] 

Wes McKinney commented on PARQUET-467:
--

Per PARQUET-462 we can go ahead and implement this level encoding. Will leave 
this issue open until it's been tested sufficiently

> Check for and raise error for deprecated BIT_PACKED encoding
> 
>
> Key: PARQUET-467
> URL: https://issues.apache.org/jira/browse/PARQUET-467
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Priority: Minor
>
> This is implemented in Impala, but unclear how much data in the wild is 
> encoded in this format (deprecating according to parquet-format) with RLE as 
> the preferred encoding (for repetition/definition levels). At minimum we 
> should raise an exception if this encoding is encoutnered.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-438) Update RLE encoder/decoder modules from Impala upstream changes and adapt unit tests

2016-01-29 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15124173#comment-15124173
 ] 

Wes McKinney commented on PARQUET-438:
--

[~mdeepak] If you identify a specific problem with the Impala RLE 
encoding/decoding code, I'm sure the team will be very keen to hear about it. 

> Update RLE encoder/decoder modules from Impala upstream changes and adapt 
> unit tests
> 
>
> Key: PARQUET-438
> URL: https://issues.apache.org/jira/browse/PARQUET-438
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: Wes McKinney
>
> Depends on PARQUET-437



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)