[jira] [Commented] (PARQUET-514) Automate coveralls.io updates in Travis CI

2016-02-18 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15153679#comment-15153679
 ] 

Wes McKinney commented on PARQUET-514:
--

see patch https://github.com/apache/parquet-cpp/pull/57

> Automate coveralls.io updates in Travis CI
> --
>
> Key: PARQUET-514
> URL: https://issues.apache.org/jira/browse/PARQUET-514
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Priority: Minor
>
> The repo has been enabled in INFRA-11273, so all that's left is to work on 
> the Travis CI build matrix and add coveralls to one of the builds (rather 
> than running it for all of them)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PARQUET-470) Thrift 0.9.3 cannot be used in conjunction with googletest and C++11 on Linux

2016-02-18 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved PARQUET-470.
--
Resolution: Fixed

Resolved in PARQUET-468

> Thrift 0.9.3 cannot be used in conjunction with googletest and C++11 on Linux
> -
>
> Key: PARQUET-470
> URL: https://issues.apache.org/jira/browse/PARQUET-470
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>
> Thrift 0.9.3 introduces a {{#include }} include which 
> causes {{tr1/functional}} to be included, causing a compiler conflict with 
> googletest, which has its own portability macros surrounding its use of 
> {{std::tr1::tuple}}. I spent a bunch of time twiddling compiler flags to try 
> to resolve this conflict, but wasn't able to figure it out. 
> If this is a Thrift bug, we should report it to Thrift. If it's fixable by 
> compiler flags, then we should figure that out and track the issue here, 
> otherwise users with the latest version of Thrift will be unable to compile 
> the parquet-cpp test suite.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-531) Can't read past first page in a column

2016-02-18 Thread Deepak Majeti (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15153527#comment-15153527
 ] 

Deepak Majeti commented on PARQUET-531:
---

I will work with [~asandryh] and try to push them by tomorrow or this weekend. 

> Can't read past first page in a column
> --
>
> Key: PARQUET-531
> URL: https://issues.apache.org/jira/browse/PARQUET-531
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
> Environment: Ubuntu Linux 14.04 (no obvious platform dependence), 
> Parquet file created by Apache Spark 1.5.0 on the same platform. 
>Reporter: Spiro Michaylov
>Assignee: Deepak Majeti
> Attachments: 
> part-r-00031-e5d9a4ef-d73e-406c-8c2f-9ad1f20ebf8e.gz.parquet
>
>
> Building the code as of 2/14/2015 and adding the obvious three lines of code 
> to serialized-page.cc to enable the newly added CompressionCodec::GZIP:
> {code}
>  case parquet::CompressionCodec::GZIP:
>decompressor_.reset(new GZipCodec());
>break;
> {code}
> I try to run the parquet_reader example on the column I'm about to attach, 
> which was created by Apache Spark 1.5.0. It works surprisingly well until it 
> hits the end of the first page, where it dies with  
> {quote}
> Parquet error: Value was non-null, but has not been buffered
> {quote}
> I realize you may be reluctant to look at this because (a) the GZip support 
> is new and (b) I had to modify the code to enable it, but actually things 
> seem to decompress just fine (congratulations: this is awesome!): looking at 
> the problem in the debugger and tracing through a bit it seems to me like the 
> buffering is a bit screwed up in general -- some kind of confusion between 
> the buffering at the Scanner and Reader levels. I can reproduce the problem 
> by reading through just a single column too. 
> It fails after 128 rows, which is suspicious given this line in 
> column/scanner.h:
> {code}
> DEFAULT_SCANNER_BATCH_SIZE = 128;
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-531) Can't read past first page in a column

2016-02-18 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15153506#comment-15153506
 ] 

Wes McKinney commented on PARQUET-531:
--

Do you have a time estimate for these patches (for my own planning)?

> Can't read past first page in a column
> --
>
> Key: PARQUET-531
> URL: https://issues.apache.org/jira/browse/PARQUET-531
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
> Environment: Ubuntu Linux 14.04 (no obvious platform dependence), 
> Parquet file created by Apache Spark 1.5.0 on the same platform. 
>Reporter: Spiro Michaylov
>Assignee: Deepak Majeti
> Attachments: 
> part-r-00031-e5d9a4ef-d73e-406c-8c2f-9ad1f20ebf8e.gz.parquet
>
>
> Building the code as of 2/14/2015 and adding the obvious three lines of code 
> to serialized-page.cc to enable the newly added CompressionCodec::GZIP:
> {code}
>  case parquet::CompressionCodec::GZIP:
>decompressor_.reset(new GZipCodec());
>break;
> {code}
> I try to run the parquet_reader example on the column I'm about to attach, 
> which was created by Apache Spark 1.5.0. It works surprisingly well until it 
> hits the end of the first page, where it dies with  
> {quote}
> Parquet error: Value was non-null, but has not been buffered
> {quote}
> I realize you may be reluctant to look at this because (a) the GZip support 
> is new and (b) I had to modify the code to enable it, but actually things 
> seem to decompress just fine (congratulations: this is awesome!): looking at 
> the problem in the debugger and tracing through a bit it seems to me like the 
> buffering is a bit screwed up in general -- some kind of confusion between 
> the buffering at the Scanner and Reader levels. I can reproduce the problem 
> by reading through just a single column too. 
> It fails after 128 rows, which is suspicious given this line in 
> column/scanner.h:
> {code}
> DEFAULT_SCANNER_BATCH_SIZE = 128;
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-531) Can't read past first page in a column

2016-02-18 Thread Deepak Majeti (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15153496#comment-15153496
 ] 

Deepak Majeti commented on PARQUET-531:
---

This will be resolved by the upcoming patches for PARQUET-526 and PARQUET-532

> Can't read past first page in a column
> --
>
> Key: PARQUET-531
> URL: https://issues.apache.org/jira/browse/PARQUET-531
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
> Environment: Ubuntu Linux 14.04 (no obvious platform dependence), 
> Parquet file created by Apache Spark 1.5.0 on the same platform. 
>Reporter: Spiro Michaylov
>Assignee: Deepak Majeti
> Attachments: 
> part-r-00031-e5d9a4ef-d73e-406c-8c2f-9ad1f20ebf8e.gz.parquet
>
>
> Building the code as of 2/14/2015 and adding the obvious three lines of code 
> to serialized-page.cc to enable the newly added CompressionCodec::GZIP:
> {code}
>  case parquet::CompressionCodec::GZIP:
>decompressor_.reset(new GZipCodec());
>break;
> {code}
> I try to run the parquet_reader example on the column I'm about to attach, 
> which was created by Apache Spark 1.5.0. It works surprisingly well until it 
> hits the end of the first page, where it dies with  
> {quote}
> Parquet error: Value was non-null, but has not been buffered
> {quote}
> I realize you may be reluctant to look at this because (a) the GZip support 
> is new and (b) I had to modify the code to enable it, but actually things 
> seem to decompress just fine (congratulations: this is awesome!): looking at 
> the problem in the debugger and tracing through a bit it seems to me like the 
> buffering is a bit screwed up in general -- some kind of confusion between 
> the buffering at the Scanner and Reader levels. I can reproduce the problem 
> by reading through just a single column too. 
> It fails after 128 rows, which is suspicious given this line in 
> column/scanner.h:
> {code}
> DEFAULT_SCANNER_BATCH_SIZE = 128;
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PARQUET-471) Use the same environment setup script for Travis CI as local sandbox development

2016-02-18 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem resolved PARQUET-471.
---
   Resolution: Fixed
Fix Version/s: cpp-0.1

Issue resolved by pull request 54
[https://github.com/apache/parquet-cpp/pull/54]

> Use the same environment setup script for Travis CI as local sandbox 
> development
> 
>
> Key: PARQUET-471
> URL: https://issues.apache.org/jira/browse/PARQUET-471
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Assignee: Wes McKinney
> Fix For: cpp-0.1
>
>
> Currently the environment setups are slightly different, and so a passing 
> Travis CI build might have a problem with the sandbox build and vice versa.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PARQUET-499) Complete PlainEncoder implementation for all primitive types and test end to end

2016-02-18 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem resolved PARQUET-499.
---
   Resolution: Fixed
Fix Version/s: cpp-0.1

resolved by:
https://github.com/apache/parquet-cpp/pull/52

> Complete PlainEncoder implementation for all primitive types and test end to 
> end
> 
>
> Key: PARQUET-499
> URL: https://issues.apache.org/jira/browse/PARQUET-499
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Assignee: Deepak Majeti
> Fix For: cpp-0.1
>
>
> As part of PARQUET-485, I added a partial {{Encoding::PLAIN}} encoder 
> implementation. This needs to be finished, with a test suite that validates 
> data round-trips across all primitive types. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-516) Add better error handling for reading local files

2016-02-18 Thread Aliaksei Sandryhaila (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15153264#comment-15153264
 ] 

Aliaksei Sandryhaila commented on PARQUET-516:
--

Initial PR is available: https://github.com/apache/parquet-cpp/pull/56

> Add better error handling for reading local files
> -
>
> Key: PARQUET-516
> URL: https://issues.apache.org/jira/browse/PARQUET-516
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Assignee: Aliaksei Sandryhaila
>Priority: Minor
>
> The {{LocalFile}} reader class does not handle the various failure modes for 
> the cstdio system calls.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PARQUET-478) Reassembly algorithms for Arrow in-memory columnar memory layout

2016-02-18 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated PARQUET-478:
-
Description: 
I plan to use parquet-cpp primarily in conjunction with columnar data 
structures (http://arrow.apache.org). 

Specifically, this requires in the interpretation of repetition / definition 
levels:

* Computing null bits / bytes for each logical level of nested tree (group, 
array, primitive leaf)
* Computing implied array sizes for each repeated group (according to 1, 2, or 
3-level array encoding)

The results of this reconstruction will be simply C arrays accompanied by the 
parquet-cpp logical schema; this way we can make it easy to adapt to different 
in-memory columnar memory schemes. 

As far as implementation, it would make sense to proceed first with functional 
unit tests of the reassembly algorithms using repetition / definition levels 
declared in the test suite as C++ vectors -- otherwise it's going to be too 
tedious trying to produce valid Parquet test data files which explore all of 
the different edge cases.

Several other teams (Spark, Drill, Parquet-Java) are currently working on 
related efforts along these lines, so we can engage when appropriate to 
collaborate on algorithms and nuances of this approach to avoid unnecessary 
code churn / bugs. 

  was:
I plan to use parquet-cpp primarily in conjunction with columnar data 
structures. 

Specifically, this requires in the interpretation of repetition / definition 
levels:

* Computing null bits / bytes for each logical level of nested tree (group, 
array, primitive leaf)
* Computing implied array sizes for each repeated group (according to 1, 2, or 
3-level array encoding)

The results of this reconstruction will be simply C arrays accompanied by the 
parquet-cpp logical schema; this way we can make it easy to adapt to different 
in-memory columnar memory schemes. 

As far as implementation, it would make sense to proceed first with functional 
unit tests of the reassembly algorithms using repetition / definition levels 
declared in the test suite as C++ vectors -- otherwise it's going to be too 
tedious trying to produce valid Parquet test data files which explore all of 
the different edge cases.

Several other teams (Spark, Drill, Parquet-Java) are currently working on 
related efforts along these lines, so we can engage when appropriate to 
collaborate on algorithms and nuances of this approach to avoid unnecessary 
code churn / bugs. 

Summary: Reassembly algorithms for Arrow in-memory columnar memory 
layout  (was: Reassembly algorithms for nested in-memory columnar memory layout)

> Reassembly algorithms for Arrow in-memory columnar memory layout
> 
>
> Key: PARQUET-478
> URL: https://issues.apache.org/jira/browse/PARQUET-478
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>
> I plan to use parquet-cpp primarily in conjunction with columnar data 
> structures (http://arrow.apache.org). 
> Specifically, this requires in the interpretation of repetition / definition 
> levels:
> * Computing null bits / bytes for each logical level of nested tree (group, 
> array, primitive leaf)
> * Computing implied array sizes for each repeated group (according to 1, 2, 
> or 3-level array encoding)
> The results of this reconstruction will be simply C arrays accompanied by the 
> parquet-cpp logical schema; this way we can make it easy to adapt to 
> different in-memory columnar memory schemes. 
> As far as implementation, it would make sense to proceed first with 
> functional unit tests of the reassembly algorithms using repetition / 
> definition levels declared in the test suite as C++ vectors -- otherwise it's 
> going to be too tedious trying to produce valid Parquet test data files which 
> explore all of the different edge cases.
> Several other teams (Spark, Drill, Parquet-Java) are currently working on 
> related efforts along these lines, so we can engage when appropriate to 
> collaborate on algorithms and nuances of this approach to avoid unnecessary 
> code churn / bugs. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-537) LocalFileSource leaks resources

2016-02-18 Thread Aliaksei Sandryhaila (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15153255#comment-15153255
 ] 

Aliaksei Sandryhaila commented on PARQUET-537:
--

Yes, that's what it looks like. It's strange, since {{LocalFileSource}} should 
be taken care of during the destruction of the {{unique_ptr}} holding it. I'll 
post a PR shortly.

> LocalFileSource leaks resources
> ---
>
> Key: PARQUET-537
> URL: https://issues.apache.org/jira/browse/PARQUET-537
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Affects Versions: cpp-0.1
>Reporter: Aliaksei Sandryhaila
>
> As a result of modifications introduced in PARQUET-497, LocalFileSource never 
> gets deleted and the associated memory and file handle are leaked.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (PARQUET-470) Thrift 0.9.3 cannot be used in conjunction with googletest and C++11 on Linux

2016-02-18 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned PARQUET-470:


Assignee: Wes McKinney

> Thrift 0.9.3 cannot be used in conjunction with googletest and C++11 on Linux
> -
>
> Key: PARQUET-470
> URL: https://issues.apache.org/jira/browse/PARQUET-470
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>
> Thrift 0.9.3 introduces a {{#include }} include which 
> causes {{tr1/functional}} to be included, causing a compiler conflict with 
> googletest, which has its own portability macros surrounding its use of 
> {{std::tr1::tuple}}. I spent a bunch of time twiddling compiler flags to try 
> to resolve this conflict, but wasn't able to figure it out. 
> If this is a Thrift bug, we should report it to Thrift. If it's fixable by 
> compiler flags, then we should figure that out and track the issue here, 
> otherwise users with the latest version of Thrift will be unable to compile 
> the parquet-cpp test suite.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-470) Thrift 0.9.3 cannot be used in conjunction with googletest and C++11 on Linux

2016-02-18 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15153252#comment-15153252
 ] 

Wes McKinney commented on PARQUET-470:
--

this is fixed in https://github.com/apache/parquet-cpp/pull/55

> Thrift 0.9.3 cannot be used in conjunction with googletest and C++11 on Linux
> -
>
> Key: PARQUET-470
> URL: https://issues.apache.org/jira/browse/PARQUET-470
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Wes McKinney
>
> Thrift 0.9.3 introduces a {{#include }} include which 
> causes {{tr1/functional}} to be included, causing a compiler conflict with 
> googletest, which has its own portability macros surrounding its use of 
> {{std::tr1::tuple}}. I spent a bunch of time twiddling compiler flags to try 
> to resolve this conflict, but wasn't able to figure it out. 
> If this is a Thrift bug, we should report it to Thrift. If it's fixable by 
> compiler flags, then we should figure that out and track the issue here, 
> otherwise users with the latest version of Thrift will be unable to compile 
> the parquet-cpp test suite.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-537) LocalFileSource leaks resources

2016-02-18 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15153246#comment-15153246
 ] 

Wes McKinney commented on PARQUET-537:
--

Could you clarify how to reproduce this problem? The file's lifetime is 
currently tied to the {{ParquetFileReader}} -- are you saying that when 
{{ParquetFileReader}} is deleted that {{LocalFileSource}}'s virtual dtor is not 
called? 

> LocalFileSource leaks resources
> ---
>
> Key: PARQUET-537
> URL: https://issues.apache.org/jira/browse/PARQUET-537
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Affects Versions: cpp-0.1
>Reporter: Aliaksei Sandryhaila
>
> As a result of modifications introduced in PARQUET-497, LocalFileSource never 
> gets deleted and the associated memory and file handle are leaked.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PARQUET-537) LocalFileSource leaks resources

2016-02-18 Thread Aliaksei Sandryhaila (JIRA)
Aliaksei Sandryhaila created PARQUET-537:


 Summary: LocalFileSource leaks resources
 Key: PARQUET-537
 URL: https://issues.apache.org/jira/browse/PARQUET-537
 Project: Parquet
  Issue Type: Bug
  Components: parquet-cpp
Affects Versions: cpp-0.1
Reporter: Aliaksei Sandryhaila


As a result of modifications introduced in PARQUET-497, LocalFileSource never 
gets deleted and the associated memory and file handle are leaked.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-471) Use the same environment setup script for Travis CI as local sandbox development

2016-02-18 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15153049#comment-15153049
 ] 

Wes McKinney commented on PARQUET-471:
--

See patch https://github.com/apache/parquet-cpp/pull/54

> Use the same environment setup script for Travis CI as local sandbox 
> development
> 
>
> Key: PARQUET-471
> URL: https://issues.apache.org/jira/browse/PARQUET-471
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
>
> Currently the environment setups are slightly different, and so a passing 
> Travis CI build might have a problem with the sandbox build and vice versa.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (PARQUET-471) Use the same environment setup script for Travis CI as local sandbox development

2016-02-18 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned PARQUET-471:


Assignee: Wes McKinney

> Use the same environment setup script for Travis CI as local sandbox 
> development
> 
>
> Key: PARQUET-471
> URL: https://issues.apache.org/jira/browse/PARQUET-471
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>
> Currently the environment setups are slightly different, and so a passing 
> Travis CI build might have a problem with the sandbox build and vice versa.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PARQUET-536) Configure Travis CI caching to preserve built thirdparty in between builds

2016-02-18 Thread Wes McKinney (JIRA)
Wes McKinney created PARQUET-536:


 Summary: Configure Travis CI caching to preserve built thirdparty 
in between builds
 Key: PARQUET-536
 URL: https://issues.apache.org/jira/browse/PARQUET-536
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-cpp
Reporter: Wes McKinney


Follow up to PARQUET-471. Will speed up builds



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: HashJoin throws ParquetDecodingException with input as ParquetTupleScheme

2016-02-18 Thread Ryan Blue
Santlal,

What version of Parquet are you using? I think this was recently fixed by
Reuben.

rb

On Tue, Feb 16, 2016 at 5:16 AM, Santlal J Gupta <
santlal.gu...@bitwiseglobal.com> wrote:

> Hi,
>
> I am facing problem while using *HashJoin* with input using
> *ParquetTupleScheme*. I have two source taps of which one is using
> *TextDelimited* scheme and the other source tap is using
> *ParquetTupleScheme. *I am performing a *HashJoin *and writing the data
> as Delimited file. The program runs successfully on local mode but when i
> tried to run it on cluster, it gives following error :
>
> parquet.io.ParquetDecodingException: Can not read value at 0 in block -1
> in file hdfs://Hostname:8020/user/username/testData/lookup-file.parquet
> at
> parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:211)
> at
> parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:144)
> at
> parquet.hadoop.mapred.DeprecatedParquetInputFormat$RecordReaderWrapper.(DeprecatedParquetInputFormat.java:91)
> at
> parquet.hadoop.mapred.DeprecatedParquetInputFormat.getRecordReader(DeprecatedParquetInputFormat.java:42)
> at
> cascading.tap.hadoop.io.MultiRecordReaderIterator.makeReader(MultiRecordReaderIterator.java:123)
> at
> cascading.tap.hadoop.io.MultiRecordReaderIterator.getNextReader(MultiRecordReaderIterator.java:172)
> at
> cascading.tap.hadoop.io.MultiRecordReaderIterator.hasNext(MultiRecordReaderIterator.java:133)
> at
> cascading.tuple.TupleEntrySchemeIterator.(TupleEntrySchemeIterator.java:94)
> at
> cascading.tap.hadoop.io.HadoopTupleEntrySchemeIterator.(HadoopTupleEntrySchemeIterator.java:49)
> at
> cascading.tap.hadoop.io.HadoopTupleEntrySchemeIterator.(HadoopTupleEntrySchemeIterator.java:44)
> at cascading.tap.hadoop.Hfs.openForRead(Hfs.java:439)
> at cascading.tap.hadoop.Hfs.openForRead(Hfs.java:108)
> at
> cascading.flow.stream.element.SourceStage.map(SourceStage.java:82)
> at
> cascading.flow.stream.element.SourceStage.run(SourceStage.java:66)
> at cascading.flow.hadoop.FlowMapper.run(FlowMapper.java:139)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
> Caused by: java.lang.NullPointerException
> at
> parquet.hadoop.util.counters.mapred.MapRedCounterAdapter.increment(MapRedCounterAdapter.java:34)
> at
> parquet.hadoop.util.counters.BenchmarkCounter.incrementTotalBytes(BenchmarkCounter.java:75)
> at
> parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:349)
> at
> parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:114)
> at
> parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:191)
> ... 21 more
>
> *Below are the UseCase:*
>
> public static void main(String[] args) throws IOException {
>
> Configuration conf = new Configuration();
>
> String[] otherArgs;
>
> otherArgs = new GenericOptionsParser(conf,
> args).getRemainingArgs();
>
> String argsString = "";
> for (String arg : otherArgs) {
> argsString = argsString + " " + arg;
> }
> System.out.println("After processing arguments are:" + argsString);
>
> Properties properties = new Properties();
> properties.putAll(conf.getValByRegex(".*"));
>
> String OutputPath = "testData/BasicEx_Output";
> Class types1[] = { String.class, String.class, String.class };
> Fields f1 = new Fields("id1", "city1", "state");
>
> Tap source = new Hfs(new TextDelimited(f1, "|", "", types1,
> false), "main-txt-file.dat");
> Pipe pipe = new Pipe("ReadWrite");
>
> Scheme pScheme = new ParquetTupleScheme();
> Tap source2 = new Hfs(pScheme, "testData/lookup-file.parquet");
> Pipe pipe2 = new Pipe("ReadWrite2");
>
> Pipe tokenPipe = new HashJoin(pipe, new Fields("id1"), pipe2, new
> Fields("id"), new LeftJoin());
>
> Tap sink = new Hfs(new TextDelimited(f1, true, "|"), OutputPath,
> SinkMode.REPLACE);
>
> FlowDef flowDef1 = FlowDef.flowDef().addSource(pipe,
> source).addSource(pipe2, source2).addTailSink(tokenPipe,
> sink);
> new
> Hadoop2MR1FlowConnector(properties).connect(flowDef1).complete();
>
> }
>
>
> I have attached the input files for the reference . Please help me in
> solving this issue.
>
>
>
> I have