[jira] [Created] (HIVE-14306) Hive Failed to read Parquet Files generated by SparkSQL

2016-07-21 Thread Teng Yutong (JIRA)
Teng Yutong created HIVE-14306:
--

 Summary: Hive Failed to read Parquet Files generated by SparkSQL
 Key: HIVE-14306
 URL: https://issues.apache.org/jira/browse/HIVE-14306
 Project: Hive
  Issue Type: Bug
  Components: CLI
Affects Versions: 1.2.1
Reporter: Teng Yutong


I'm trying to implement the following process:

1. create a hive parquet table A use hive CLI
2. create a external table B whose schema just like A, but point to a exist 
folder which contains one csv file in HDSF
3. execute `insert into A select * from B` using SparkSQL
4. query table A.

wired thing happens in step 3 and 4。

If the 'insert into' statement executed by SparkSQL,Hive CLI would throw me an 
Exception when querying table A
```
Failed with exception java.io.IOException:parquet.io.ParquetDecodingException: 
Can not read value at 0 in block -1 in file 
hdfs://NEOInciteDataNode-1:8020/user/hive/warehouse/call_center/part-r-0-b9b6962d-cbab-452b-835b-c10c6221b8fa.gz.parquet
```

But SparkSQL can query table A without trouble...

If the `insert`  statement executed by Hive CLI, query table A in Hive CLI 
would be just fine...

So am I doing something wrong, or this is just a bug?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-6584) Add HiveHBaseTableSnapshotInputFormat

2014-07-03 Thread Teng Yutong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-6584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Teng Yutong updated HIVE-6584:
--

Attachment: HIVE-6584.7.patch

fix some bug..but still need changes on HBase side

 Add HiveHBaseTableSnapshotInputFormat
 -

 Key: HIVE-6584
 URL: https://issues.apache.org/jira/browse/HIVE-6584
 Project: Hive
  Issue Type: Improvement
  Components: HBase Handler
Reporter: Nick Dimiduk
Assignee: Nick Dimiduk
 Fix For: 0.14.0

 Attachments: HIVE-6584.0.patch, HIVE-6584.1.patch, HIVE-6584.2.patch, 
 HIVE-6584.3.patch, HIVE-6584.4.patch, HIVE-6584.5.patch, HIVE-6584.6.patch, 
 HIVE-6584.7.patch


 HBASE-8369 provided mapreduce support for reading from HBase table snapsopts. 
 This allows a MR job to consume a stable, read-only view of an HBase table 
 directly off of HDFS. Bypassing the online region server API provides a nice 
 performance boost for the full scan. HBASE-10642 is backporting that feature 
 to 0.94/0.96 and also adding a {{mapred}} implementation. Once that's 
 available, we should add an input format. A follow-on patch could work out 
 how to integrate this functionality into the StorageHandler, similar to how 
 HIVE-6473 integrates the HFileOutputFormat into existing table definitions.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-6584) Add HiveHBaseTableSnapshotInputFormat

2014-06-27 Thread Teng Yutong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-6584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Teng Yutong updated HIVE-6584:
--

Attachment: HIVE-6584.6.patch

hi,

sorry for the late reply...this is the regenerated patch. But It won't work 
unless  HBase has been modified. Because we need HBase expose 
TableSnapshotRegionSplit  and convertStringToScan.

BR

 Add HiveHBaseTableSnapshotInputFormat
 -

 Key: HIVE-6584
 URL: https://issues.apache.org/jira/browse/HIVE-6584
 Project: Hive
  Issue Type: Improvement
  Components: HBase Handler
Reporter: Nick Dimiduk
Assignee: Nick Dimiduk
 Fix For: 0.14.0

 Attachments: HIVE-6584.0.patch, HIVE-6584.1.patch, HIVE-6584.2.patch, 
 HIVE-6584.3.patch, HIVE-6584.4.patch, HIVE-6584.5.patch, HIVE-6584.6.patch


 HBASE-8369 provided mapreduce support for reading from HBase table snapsopts. 
 This allows a MR job to consume a stable, read-only view of an HBase table 
 directly off of HDFS. Bypassing the online region server API provides a nice 
 performance boost for the full scan. HBASE-10642 is backporting that feature 
 to 0.94/0.96 and also adding a {{mapred}} implementation. Once that's 
 available, we should add an input format. A follow-on patch could work out 
 how to integrate this functionality into the StorageHandler, similar to how 
 HIVE-6473 integrates the HFileOutputFormat into existing table definitions.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-6584) Add HiveHBaseTableSnapshotInputFormat

2014-06-18 Thread Teng Yutong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-6584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Teng Yutong updated HIVE-6584:
--

Attachment: HIVE-6584.5.patch

hi,

this patch is my current workaround when dealing with HBase snapshot.

but in order to make this patch work, still some changes is needed on the HBase 
side (change the visible descriptor of 
mapreduce.TableMapReduceUitls.convertStringToScan and 
mapreduce.TableSnapshotInputFormat.TableSnapshotRegionSplit  into public). 
Since there is no issue related to this in HBase JIRA, so i haven't create a 
patch for these changes.


 Add HiveHBaseTableSnapshotInputFormat
 -

 Key: HIVE-6584
 URL: https://issues.apache.org/jira/browse/HIVE-6584
 Project: Hive
  Issue Type: Improvement
  Components: HBase Handler
Reporter: Nick Dimiduk
Assignee: Nick Dimiduk
 Fix For: 0.14.0

 Attachments: HIVE-6584.0.patch, HIVE-6584.1.patch, HIVE-6584.2.patch, 
 HIVE-6584.3.patch, HIVE-6584.4.patch, HIVE-6584.5.patch


 HBASE-8369 provided mapreduce support for reading from HBase table snapsopts. 
 This allows a MR job to consume a stable, read-only view of an HBase table 
 directly off of HDFS. Bypassing the online region server API provides a nice 
 performance boost for the full scan. HBASE-10642 is backporting that feature 
 to 0.94/0.96 and also adding a {{mapred}} implementation. Once that's 
 available, we should add an input format. A follow-on patch could work out 
 how to integrate this functionality into the StorageHandler, similar to how 
 HIVE-6473 integrates the HFileOutputFormat into existing table definitions.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6584) Add HiveHBaseTableSnapshotInputFormat

2014-06-11 Thread Teng Yutong (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14028753#comment-14028753
 ] 

Teng Yutong commented on HIVE-6584:
---

hi nick,

i have some concerns about these patches:
1. HBaseStorageHandler.getInputFormatClass(): i am afraid that the returned 
inputformat will always be HiveHBaseTabelInputFormat (at least according to my 
test)
2. in the method HBaseStorageHandler.preCreateTable, hive will check whether 
the HBase table exist or not, regardless the external table that hive gonna 
create is based on actual table or a snapshot.
3. the TableSnapshotRegionSplit used in TableSnapshotInputFormat  is a direct 
subclass of InputSplit, not a subclass of tablesplit
4. there is no public setScan method in TableSnapshotInputFormat.RecordReader, 
instead it will translate a string into a scan instance by using 
mapreduce.TableMapReduceUitls.convertStringToScan.

So I suggest adding a subclass of HBaseStorageHandler(and other necessary 
classes) ,say HBaseSnapshotStorageHandler, to deal with the hbase snapshot 
situation.

In fact, I have already finished the necessary code changes and done some 
tests. The tests show that my modification works out.

i will upload my patch soon

 Add HiveHBaseTableSnapshotInputFormat
 -

 Key: HIVE-6584
 URL: https://issues.apache.org/jira/browse/HIVE-6584
 Project: Hive
  Issue Type: Improvement
  Components: HBase Handler
Reporter: Nick Dimiduk
Assignee: Nick Dimiduk
 Fix For: 0.14.0

 Attachments: HIVE-6584.0.patch, HIVE-6584.1.patch, HIVE-6584.2.patch, 
 HIVE-6584.3.patch


 HBASE-8369 provided mapreduce support for reading from HBase table snapsopts. 
 This allows a MR job to consume a stable, read-only view of an HBase table 
 directly off of HDFS. Bypassing the online region server API provides a nice 
 performance boost for the full scan. HBASE-10642 is backporting that feature 
 to 0.94/0.96 and also adding a {{mapred}} implementation. Once that's 
 available, we should add an input format. A follow-on patch could work out 
 how to integrate this functionality into the StorageHandler, similar to how 
 HIVE-6473 integrates the HFileOutputFormat into existing table definitions.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-6584) Add HiveHBaseTableSnapshotInputFormat

2014-05-21 Thread Teng Yutong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-6584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Teng Yutong updated HIVE-6584:
--

Attachment: HIVE-6584.1.patch

this patch is based on the newest patch(HBASE-11137.02-0.98.patch) related 
HBASE-11137.

 Add HiveHBaseTableSnapshotInputFormat
 -

 Key: HIVE-6584
 URL: https://issues.apache.org/jira/browse/HIVE-6584
 Project: Hive
  Issue Type: Improvement
  Components: HBase Handler
Reporter: Nick Dimiduk
Assignee: Nick Dimiduk
 Attachments: HIVE-6584.0.patch, HIVE-6584.1.patch


 HBASE-8369 provided mapreduce support for reading from HBase table snapsopts. 
 This allows a MR job to consume a stable, read-only view of an HBase table 
 directly off of HDFS. Bypassing the online region server API provides a nice 
 performance boost for the full scan. HBASE-10642 is backporting that feature 
 to 0.94/0.96 and also adding a {{mapred}} implementation. Once that's 
 available, we should add an input format. A follow-on patch could work out 
 how to integrate this functionality into the StorageHandler, similar to how 
 HIVE-6473 integrates the HFileOutputFormat into existing table definitions.



--
This message was sent by Atlassian JIRA
(v6.2#6252)