[jira] [Updated] (PARQUET-346) ThriftSchemaConverter throws for unknown struct or union type

2015-12-14 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-346:
--
Fix Version/s: (was: 2.0.0)
   1.9.0

> ThriftSchemaConverter throws for unknown struct or union type
> -
>
> Key: PARQUET-346
> URL: https://issues.apache.org/jira/browse/PARQUET-346
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Reporter: Alex Levenson
>Assignee: Alex Levenson
> Fix For: 1.9.0
>
>
> ThriftSchemaConverter should either only be called on ThriftStruct's that 
> have populated structOrUnionType metadata, or should support a mode where 
> this data is unknown w/o throwing an exception.
> Currently it is called using the file's metadata here:
> https://github.com/apache/parquet-mr/blob/d6f082b9be5d507ff60c6bc83a179cc44015ab97/parquet-thrift/src/main/java/org/apache/parquet/thrift/ThriftRecordConverter.java#L797
> One workaround is not not use the file matadata here but rather the schema 
> from the thrift class. The other is to support unknown struct or union types



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Java 7

2015-12-14 Thread Jason Altekruse
+1 to a move, the timeline seems reasonable to me. Agree we should find out
more information from users next year before actually making the move to 8,
but this is a good target.

On Mon, Dec 14, 2015 at 10:23 AM, Ryan Blue  wrote:

> +1 for moving to Java 7 now.
>
> For Java 8, I think Dec 2016 sounds reasonable given the end of public
> updates in April. But I think we should make that a goal and not a
> commitment. Java 8 code can't be run on a Java 7 VM [1] and it may take
> people a while to migrate even after free support runs out.
>
> I think it makes sense to see if we know about anyone running Java 7 mid
> next year.
>
> rb
>
>
> [1]:
> https://stackoverflow.com/questions/16143684/can-java-8-code-be-compiled-to-run-on-java-7-jvm
>
>
> On 12/12/2015 03:47 PM, Julien Le Dem wrote:
>
>> Parquet as a library has to move to new versions of java later to minimize
>> pain on users.
>> However it looks like it is high time to move to java 7
>> http://www.oracle.com/technetwork/java/eol-135779.html
>>
>> And we should probably define a deadline for java 7 support before moving
>> to java 8.
>>
>> 2 questions:
>>   - any objection to moving parquet to java 7?
>>   - let's define a date for moving to java 8 so that users have time to
>> plan
>> ahead or argue against. I propose Dec 2016. Any other proposal?
>>
>> Related PR:
>> https://github.com/apache/parquet-mr/pull/231
>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Cloudera, Inc.
>


[jira] [Created] (PARQUET-407) Incorrect delta-encoding example

2015-12-14 Thread choi woo cheol (JIRA)
choi woo cheol created PARQUET-407:
--

 Summary: Incorrect delta-encoding example
 Key: PARQUET-407
 URL: https://issues.apache.org/jira/browse/PARQUET-407
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-format
Reporter: choi woo cheol
Priority: Trivial


The minimum and the number of bits are incorrect at delta encoding Example 2 In 
{{Encodings.md}}.

In the example, 

{code}
Example 2

7, 5, 3, 1, 2, 3, 4, 5, the deltas would be

-2, -2, -2, 1, 1, 1, 1
The minimum is -2, so the relative deltas are:

0, 0, 0, 3, 3, 3, 3

The encoded data is

header: 8 (block size), 1 (miniblock count), 8 (value count), 7 (first value)

block 0 (minimum delta), 2 (bitwidth), 0011b (0,0,0,3,3,3 packed on 2 
bits)
{code}


The minimum is -2 and the relative deltas are 0, 0, 0, 3, 3, 3, 3. So, this 
should be corrected as below:

{code}
block -2 (minimum delta), 2 (bitwidth), 00b (0,0,0,3,3,3,3 packed 
on 2 bits)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Parquet sync up

2015-12-14 Thread Daniel Weeks
Works for me as well.

On Mon, Dec 14, 2015 at 12:40 PM, Reuben Kuhnert <
reuben.kuhn...@cloudera.com> wrote:

> I can make that. Thanks
>
> On Mon, Dec 14, 2015 at 12:27 PM, Ryan Blue  wrote:
>
> > Works for me.
> >
> >
> > On 12/12/2015 03:20 PM, Julien Le Dem wrote:
> >
> >> The next parquet sync up is scheduled for next week Wednesday at 10 am
> PT
> >> Any objection to move it to Thursday same time? I have a conflict.
> >>
> >>
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Cloudera, Inc.
> >
>


[jira] [Commented] (PARQUET-405) Backwards-incompatible change to thrift metadata

2015-12-14 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15057013#comment-15057013
 ] 

Ryan Blue commented on PARQUET-405:
---

Thanks, Ben! Both for reporting the issue and for helping us keep the issues 
organized.

> Backwards-incompatible change to thrift metadata
> 
>
> Key: PARQUET-405
> URL: https://issues.apache.org/jira/browse/PARQUET-405
> Project: Parquet
>  Issue Type: Bug
>Affects Versions: 1.8.0
>Reporter: Ben Kirwin
>
> Sometime in the last few versions, a {{isStructOrUnion}} field has been added 
> to the `thrift.descriptor` written to the parquet header:
> {code}
> {
> "children": [ ... ],
> "id": "STRUCT", 
> "structOrUnionType": "STRUCT"
> }
> {code}
> The current release now throws an exception when that field is missing  / 
> {{UNKNOWN}}). This makes it impossible to read back thrift data written using 
> a previous release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: parquet file doubts

2015-12-14 Thread Cheng Lian
Actually, adding a single line below this line 
 
should make parquet-meta print min/max statistics:


if (!meta.getStatistics().isEmpty()) out.format(" STA:[%s]", 
meta.getStatistics().toString());


Cheng

On 12/14/15 8:42 PM, Shushant Arora wrote:

Hi

Do you have any sample program in java to validate/read min max of 
column groups in Parquet file?


Thanks

On Tue, Dec 8, 2015 at 2:50 PM, Cheng Lian > wrote:


Cc'd Parquet dev list. At first I expected to discuss this issue
on Parquet dev list but sent to the wrong mailing list. However, I
think it's OK to discuss it here since lots of Spark users are
using Parquet and this information should be generally useful here.

Comments inlined.

On 12/7/15 10:34 PM, Shushant Arora wrote:

how to read it using parquet tools.
When I did
hadoop parquet.tools.Main meta prquetfilename

I didn't get any info of min and max values.

Didn't realize that you meant to inspect min/max values since what
you asked was how to inspect the version of Parquet library that
is used to generate the Parquet file.

Currently parquet-tools doesn't print min/max statistics
information. I'm afraid you'll have to do it programmatically.

How can I see parquet version of my file.Is min max respective to
some parquet version or available since beginning?

AFAIK, it was added in 1.5.0

https://github.com/apache/parquet-mr/blob/parquet-1.5.0/parquet-column/src/main/java/parquet/column/statistics/Statistics.java

But I failed to find corresponding JIRA ticket or pull request for
this.




On Mon, Dec 7, 2015 at 6:51 PM, Singh, Abhijeet
> wrote:

Yes, Parquet has min/max.

*From:*Cheng Lian [mailto:l...@databricks.com
]
*Sent:* Monday, December 07, 2015 11:21 AM
*To:* Ted Yu
*Cc:* Shushant Arora; u...@spark.apache.org

*Subject:* Re: parquet file doubts

Oh sorry... At first I meant to cc spark-user list since
Shushant and I had been discussed some Spark related issues
before. Then I realized that this is a pure Parquet issue,
but forgot to change the cc list. Thanks for pointing this
out! Please ignore this thread.

Cheng

On 12/7/15 12:43 PM, Ted Yu wrote:

Cheng:

I only see user@spark in the CC.

FYI

On Sun, Dec 6, 2015 at 8:01 PM, Cheng Lian
> wrote:

cc parquet-dev list (it would be nice to always do so for
these general questions.)

Cheng

On 12/6/15 3:10 PM, Shushant Arora wrote:

Hi

I have few doubts on parquet file format.

1.Does parquet keeps min max statistics like in ORC. how
can I see parquet version(whether its1.1,1.2or1.3) for
parquet file generated using hive or custom MR or
AvroParquetoutputFormat.

Yes, Parquet also keeps row group statistics. You may
check the Parquet file using the parquet-meta CLI tool in
parquet-tools (see
https://github.com/Parquet/parquet-mr/issues/321 for
details), then look for the "creator" field of the file.
For programmatic access, check for
o.a.p.hadoop.metadata.FileMetaData.createdBy.


2.how to sort parquet records while generating parquet
file using avroparquetoutput format?

AvroParquetOutputFormat is not a format. It's just
responsible for converting Avro records to Parquet
records. How are you using AvroParquetOutputFormat? Any
example snippets?


Thanks




-
To unsubscribe, e-mail:

user-unsubscr...@spark.apache.org

For additional commands, e-mail:
user-h...@spark.apache.org










[jira] [Created] (PARQUET-405) Backwards-incompatible change to thrift metadata

2015-12-14 Thread Ben Kirwin (JIRA)
Ben Kirwin created PARQUET-405:
--

 Summary: Backwards-incompatible change to thrift metadata
 Key: PARQUET-405
 URL: https://issues.apache.org/jira/browse/PARQUET-405
 Project: Parquet
  Issue Type: Bug
Affects Versions: 1.8.0
Reporter: Ben Kirwin


Sometime in the last few versions, a {{isStructOrUnion}} field has been added 
to the `thrift.descriptor` written to the parquet header:

{code}
{
"children": [ ... ],
"id": "STRUCT", 
"structOrUnionType": "STRUCT"
}
{code}

The current release now throws an exception when that field is missing  / 
{{UNKNOWN}}). This makes it impossible to read back thrift data written using a 
previous release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PARQUET-404) Add a note to dev/README.md that mentions that a ssh public key on comitter github account is needed

2015-12-14 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/PARQUET-404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergio Peña updated PARQUET-404:

Summary: Add a note to dev/README.md that mentions that a ssh public key on 
comitter github account is needed  (was: Add a note to dev/README.md that 
mentions that a ssh public key on persona github account is needed)

> Add a note to dev/README.md that mentions that a ssh public key on comitter 
> github account is needed
> 
>
> Key: PARQUET-404
> URL: https://issues.apache.org/jira/browse/PARQUET-404
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-mr
>Affects Versions: 1.8.0
>Reporter: Sergio Peña
>Assignee: Sergio Peña
>Priority: Trivial
>
> When trying to merge a PR following the notes from dev/README.md and using 
> dev/merge_parquet_pr.py and without having a public key attached to the 
> committer github account, then a permission error is displayed when 
> attempting to fetch from {{g...@github.com:apache/parquet-mr.git}}.
> We should add a note to dev/README.md that mentions that a SSH public key on 
> the committer github account is needed in order to fetch code from 
> {{apache-github}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-393) release parquet-format 2.3.1

2015-12-14 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15056385#comment-15056385
 ] 

Ryan Blue commented on PARQUET-393:
---

Vote thread started. Please vote!

> release parquet-format 2.3.1
> 
>
> Key: PARQUET-393
> URL: https://issues.apache.org/jira/browse/PARQUET-393
> Project: Parquet
>  Issue Type: Task
>Reporter: Julien Le Dem
>Assignee: Ryan Blue
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Error due to null Counter

2015-12-14 Thread Stephen Bly
Greetings Parquet developers. I am trying to create my own custom InputFormat 
for reading Parquet tables in Hive. This is how I create the table:

CREATE EXTERNAL TABLE api_hit_parquet_test ROW FORMAT SERDE 
'com.foursquare.hadoop.hive.serde.RecordV2SerDe' WITH SERDEPROPERTIES 
('serialization.class' = 'com.foursquare.logs.gen.ApiHit') STORED AS 
INPUTFORMAT 'com.foursquare.hadoop.hive.io.HiveThriftParquetInputFormat' 
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' 
LOCATION '/user/bly/api_hit_parquet' TBLPROPERTIES 
('thrift.parquetfile.input.format.thrift.class' = 
'com.foursquare.logs.gen.ApiHit’)

The table is successfully created, and I can verify the schema is correct by 
running DESCRIBE FORMATTED on it. However, when I try to do a simple SELECT * 
on the table, I get the following stack trace:

java.io.IOException: java.lang.RuntimeException: Could not read first record 
(and it was not an EOF)
at 
org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:507)
at 
org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:414)
at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:138)
at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:1657)
at 
org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:227)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:159)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:370)
at 
org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:756)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:675)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:615)
Caused by: java.lang.RuntimeException: Could not read first record (and it was 
not an EOF)
at 
com.twitter.elephantbird.mapred.input.DeprecatedInputFormatWrapper$RecordReaderWrapper.initKeyValueObjects(DeprecatedInputFormatWrapper.java:280)
at 
com.twitter.elephantbird.mapred.input.DeprecatedInputFormatWrapper$RecordReaderWrapper.createValue(DeprecatedInputFormatWrapper.java:297)
at 
com.foursquare.hadoop.hive.io.HiveThriftParquetInputFormat$$anon$1.(HiveThriftParquetInputFormat.scala:47)
at 
com.foursquare.hadoop.hive.io.HiveThriftParquetInputFormat.getRecordReader(HiveThriftParquetInputFormat.scala:46)
at 
org.apache.hadoop.hive.ql.exec.FetchOperator$FetchInputFormatSplit.getRecordReader(FetchOperator.java:667)
at 
org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:323)
at 
org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:445)
... 9 more
Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value 
at 0 in block -1 in file 
hdfs://hadoop-alidoro-nn-vip/user/bly/api_hit_parquet/part-m-0.parquet
at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:243)
at 
org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:227)
at 
com.twitter.elephantbird.mapred.input.DeprecatedInputFormatWrapper$RecordReaderWrapper.initKeyValueObjects(DeprecatedInputFormatWrapper.java:271)
... 15 more
Caused by: java.lang.NullPointerException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
org.apache.parquet.hadoop.util.ContextUtil.invoke(ContextUtil.java:264)
at 
org.apache.parquet.hadoop.util.ContextUtil.incrementCounter(ContextUtil.java:273)
at 
org.apache.parquet.hadoop.util.counters.mapreduce.MapReduceCounterAdapter.increment(MapReduceCounterAdapter.java:38)
at 
org.apache.parquet.hadoop.util.counters.BenchmarkCounter.incrementTotalBytes(BenchmarkCounter.java:78)
at 
org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:497)
at 
org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:130)
at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:214)
... 17 more

I have spent some time following this stack trace, and it appears that the 
error lies in the Counter code, which is odd because I don’t do anything with 
that. Is there some way I need to initialize counters?

To be specific, I have found that MapReduceCounterAdapter is being created with 
a null parameter. Here is the constructor:

public MapReduceCounterAdapter(Counter adaptee) {
this.adaptee = adaptee;
  }

So adaptee is being passed as null, and then getting called later on, causing 
my NullPointerException.

The adaptee parameter is created by this 

[VOTE] Release Apache Parquet Format 2.3.1 RC1

2015-12-14 Thread Ryan Blue

Hi everyone,

I propose the following RC to be released as official Apache Parquet 
Format 2.3.1 release.


The commit id is f89d589fe3f5ec62f16fb101ba0213d097cf2021
* This corresponds to the tag: apache-parquet-format-2.3.1
* https://github.com/apache/parquet-format/tree/f89d589f
* 
https://git-wip-us.apache.org/repos/asf/projects/repo?p=parquet-format.git=commit=f89d589f


The release tarball, signature, and checksums are here:
* 
https://dist.apache.org/repos/dist/dev/parquet/apache-parquet-format-2.3.1-rc1/


You can find the KEYS file here:
* https://dist.apache.org/repos/dist/dev/parquet/KEYS

Binary artifacts are staged in Nexus here:
* 
https://repository.apache.org/content/groups/staging/org/apache/parquet/parquet-format/2.3.1/


This release is the first release of parquet-format after incubation and 
is mostly updates to the spec. The only code change is that slf4j-nop is 
now shaded, which removes a warning message.


Please download, verify, and test.

Please vote by about 10 AM PST on 17 December.

[ ] +1 Release this as Apache Parquet Format 2.3.1
[ ] +0
[ ] -1 Do not release this because...


--
Ryan Blue


[jira] [Resolved] (PARQUET-403) Remove incubating from parquet-format release process

2015-12-14 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved PARQUET-403.
---
   Resolution: Fixed
 Assignee: Ryan Blue
Fix Version/s: format-2.3.1

Merged #34. Thanks for reviewing, Julien!

> Remove incubating from parquet-format release process
> -
>
> Key: PARQUET-403
> URL: https://issues.apache.org/jira/browse/PARQUET-403
> Project: Parquet
>  Issue Type: Bug
>Reporter: Ryan Blue
>Assignee: Ryan Blue
> Fix For: format-2.3.1
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Java 7

2015-12-14 Thread Ryan Blue

+1 for moving to Java 7 now.

For Java 8, I think Dec 2016 sounds reasonable given the end of public 
updates in April. But I think we should make that a goal and not a 
commitment. Java 8 code can't be run on a Java 7 VM [1] and it may take 
people a while to migrate even after free support runs out.


I think it makes sense to see if we know about anyone running Java 7 mid 
next year.


rb


[1]: 
https://stackoverflow.com/questions/16143684/can-java-8-code-be-compiled-to-run-on-java-7-jvm


On 12/12/2015 03:47 PM, Julien Le Dem wrote:

Parquet as a library has to move to new versions of java later to minimize
pain on users.
However it looks like it is high time to move to java 7
http://www.oracle.com/technetwork/java/eol-135779.html

And we should probably define a deadline for java 7 support before moving
to java 8.

2 questions:
  - any objection to moving parquet to java 7?
  - let's define a date for moving to java 8 so that users have time to plan
ahead or argue against. I propose Dec 2016. Any other proposal?

Related PR:
https://github.com/apache/parquet-mr/pull/231




--
Ryan Blue
Software Engineer
Cloudera, Inc.


Re: Parquet sync up

2015-12-14 Thread Ryan Blue

Works for me.

On 12/12/2015 03:20 PM, Julien Le Dem wrote:

The next parquet sync up is scheduled for next week Wednesday at 10 am PT
Any objection to move it to Thursday same time? I have a conflict.




--
Ryan Blue
Software Engineer
Cloudera, Inc.


Re: Parquet sync up

2015-12-14 Thread Reuben Kuhnert
I can make that. Thanks

On Mon, Dec 14, 2015 at 12:27 PM, Ryan Blue  wrote:

> Works for me.
>
>
> On 12/12/2015 03:20 PM, Julien Le Dem wrote:
>
>> The next parquet sync up is scheduled for next week Wednesday at 10 am PT
>> Any objection to move it to Thursday same time? I have a conflict.
>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Cloudera, Inc.
>


Re: Error due to null Counter

2015-12-14 Thread Stephen Bly
Thanks so much for looking into this! I’m pretty sure the issue is on my end 
and not in the Parquet/Hive code (I’m rather inexperienced in the world of Big 
Data and Hadoop in particular). But the error message is a little obscure so I 
can’t figure out what I’m doing wrong to fix it.

Let me know if you need to see any more of my code to help you investigate this.

Re: Error due to null Counter

2015-12-14 Thread Reuben Kuhnert
Hi Stephen,

I created ticket: https://issues.apache.org/jira/browse/PARQUET-406 to
track your issue. We'll take a look to track down your issue and then get
back to you.

Thanks, and let us know if you have any other questions.
Reuben

On Mon, Dec 14, 2015 at 12:22 PM, Stephen Bly  wrote:

> Greetings Parquet developers. I am trying to create my own custom
> InputFormat for reading Parquet tables in Hive. This is how I create the
> table:
>
> CREATE EXTERNAL TABLE api_hit_parquet_test ROW FORMAT SERDE
> 'com.foursquare.hadoop.hive.serde.RecordV2SerDe' WITH SERDEPROPERTIES
> ('serialization.class' = 'com.foursquare.logs.gen.ApiHit') STORED AS
> INPUTFORMAT 'com.foursquare.hadoop.hive.io.HiveThriftParquetInputFormat'
> OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
> LOCATION '/user/bly/api_hit_parquet' TBLPROPERTIES
> ('thrift.parquetfile.input.format.thrift.class' =
> 'com.foursquare.logs.gen.ApiHit’)
>
> The table is successfully created, and I can verify the schema is correct
> by running DESCRIBE FORMATTED on it. However, when I try to do a simple
> SELECT * on the table, I get the following stack trace:
>
> java.io.IOException: java.lang.RuntimeException: Could not read first
> record (and it was not an EOF)
> at
> org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:507)
> at
> org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:414)
> at
> org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:138)
> at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:1657)
> at
> org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:227)
> at
> org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:159)
> at
> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:370)
> at
> org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:756)
> at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:675)
> at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:615)
> Caused by: java.lang.RuntimeException: Could not read first record (and it
> was not an EOF)
> at
> com.twitter.elephantbird.mapred.input.DeprecatedInputFormatWrapper$RecordReaderWrapper.initKeyValueObjects(DeprecatedInputFormatWrapper.java:280)
> at
> com.twitter.elephantbird.mapred.input.DeprecatedInputFormatWrapper$RecordReaderWrapper.createValue(DeprecatedInputFormatWrapper.java:297)
> at
> com.foursquare.hadoop.hive.io.HiveThriftParquetInputFormat$$anon$1.(HiveThriftParquetInputFormat.scala:47)
> at
> com.foursquare.hadoop.hive.io.HiveThriftParquetInputFormat.getRecordReader(HiveThriftParquetInputFormat.scala:46)
> at
> org.apache.hadoop.hive.ql.exec.FetchOperator$FetchInputFormatSplit.getRecordReader(FetchOperator.java:667)
> at
> org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:323)
> at
> org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:445)
> ... 9 more
> Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read
> value at 0 in block -1 in file
> hdfs://hadoop-alidoro-nn-vip/user/bly/api_hit_parquet/part-m-0.parquet
> at
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:243)
> at
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:227)
> at
> com.twitter.elephantbird.mapred.input.DeprecatedInputFormatWrapper$RecordReaderWrapper.initKeyValueObjects(DeprecatedInputFormatWrapper.java:271)
> ... 15 more
> Caused by: java.lang.NullPointerException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at
> org.apache.parquet.hadoop.util.ContextUtil.invoke(ContextUtil.java:264)
> at
> org.apache.parquet.hadoop.util.ContextUtil.incrementCounter(ContextUtil.java:273)
> at
> org.apache.parquet.hadoop.util.counters.mapreduce.MapReduceCounterAdapter.increment(MapReduceCounterAdapter.java:38)
> at
> org.apache.parquet.hadoop.util.counters.BenchmarkCounter.incrementTotalBytes(BenchmarkCounter.java:78)
> at
> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:497)
> at
> org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:130)
> at
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:214)
> ... 17 more
>
> I have spent some time following this stack trace, and it appears that the
> error lies in the Counter code, which is