[jira] [Created] (HIVE-14306) Hive Failed to read Parquet Files generated by SparkSQL
Teng Yutong created HIVE-14306: -- Summary: Hive Failed to read Parquet Files generated by SparkSQL Key: HIVE-14306 URL: https://issues.apache.org/jira/browse/HIVE-14306 Project: Hive Issue Type: Bug Components: CLI Affects Versions: 1.2.1 Reporter: Teng Yutong I'm trying to implement the following process: 1. create a hive parquet table A use hive CLI 2. create a external table B whose schema just like A, but point to a exist folder which contains one csv file in HDSF 3. execute `insert into A select * from B` using SparkSQL 4. query table A. wired thing happens in step 3 and 4。 If the 'insert into' statement executed by SparkSQL,Hive CLI would throw me an Exception when querying table A ``` Failed with exception java.io.IOException:parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file hdfs://NEOInciteDataNode-1:8020/user/hive/warehouse/call_center/part-r-0-b9b6962d-cbab-452b-835b-c10c6221b8fa.gz.parquet ``` But SparkSQL can query table A without trouble... If the `insert` statement executed by Hive CLI, query table A in Hive CLI would be just fine... So am I doing something wrong, or this is just a bug? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-6584) Add HiveHBaseTableSnapshotInputFormat
[ https://issues.apache.org/jira/browse/HIVE-6584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Teng Yutong updated HIVE-6584: -- Attachment: HIVE-6584.7.patch fix some bug..but still need changes on HBase side Add HiveHBaseTableSnapshotInputFormat - Key: HIVE-6584 URL: https://issues.apache.org/jira/browse/HIVE-6584 Project: Hive Issue Type: Improvement Components: HBase Handler Reporter: Nick Dimiduk Assignee: Nick Dimiduk Fix For: 0.14.0 Attachments: HIVE-6584.0.patch, HIVE-6584.1.patch, HIVE-6584.2.patch, HIVE-6584.3.patch, HIVE-6584.4.patch, HIVE-6584.5.patch, HIVE-6584.6.patch, HIVE-6584.7.patch HBASE-8369 provided mapreduce support for reading from HBase table snapsopts. This allows a MR job to consume a stable, read-only view of an HBase table directly off of HDFS. Bypassing the online region server API provides a nice performance boost for the full scan. HBASE-10642 is backporting that feature to 0.94/0.96 and also adding a {{mapred}} implementation. Once that's available, we should add an input format. A follow-on patch could work out how to integrate this functionality into the StorageHandler, similar to how HIVE-6473 integrates the HFileOutputFormat into existing table definitions. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HIVE-6584) Add HiveHBaseTableSnapshotInputFormat
[ https://issues.apache.org/jira/browse/HIVE-6584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Teng Yutong updated HIVE-6584: -- Attachment: HIVE-6584.6.patch hi, sorry for the late reply...this is the regenerated patch. But It won't work unless HBase has been modified. Because we need HBase expose TableSnapshotRegionSplit and convertStringToScan. BR Add HiveHBaseTableSnapshotInputFormat - Key: HIVE-6584 URL: https://issues.apache.org/jira/browse/HIVE-6584 Project: Hive Issue Type: Improvement Components: HBase Handler Reporter: Nick Dimiduk Assignee: Nick Dimiduk Fix For: 0.14.0 Attachments: HIVE-6584.0.patch, HIVE-6584.1.patch, HIVE-6584.2.patch, HIVE-6584.3.patch, HIVE-6584.4.patch, HIVE-6584.5.patch, HIVE-6584.6.patch HBASE-8369 provided mapreduce support for reading from HBase table snapsopts. This allows a MR job to consume a stable, read-only view of an HBase table directly off of HDFS. Bypassing the online region server API provides a nice performance boost for the full scan. HBASE-10642 is backporting that feature to 0.94/0.96 and also adding a {{mapred}} implementation. Once that's available, we should add an input format. A follow-on patch could work out how to integrate this functionality into the StorageHandler, similar to how HIVE-6473 integrates the HFileOutputFormat into existing table definitions. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HIVE-6584) Add HiveHBaseTableSnapshotInputFormat
[ https://issues.apache.org/jira/browse/HIVE-6584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Teng Yutong updated HIVE-6584: -- Attachment: HIVE-6584.5.patch hi, this patch is my current workaround when dealing with HBase snapshot. but in order to make this patch work, still some changes is needed on the HBase side (change the visible descriptor of mapreduce.TableMapReduceUitls.convertStringToScan and mapreduce.TableSnapshotInputFormat.TableSnapshotRegionSplit into public). Since there is no issue related to this in HBase JIRA, so i haven't create a patch for these changes. Add HiveHBaseTableSnapshotInputFormat - Key: HIVE-6584 URL: https://issues.apache.org/jira/browse/HIVE-6584 Project: Hive Issue Type: Improvement Components: HBase Handler Reporter: Nick Dimiduk Assignee: Nick Dimiduk Fix For: 0.14.0 Attachments: HIVE-6584.0.patch, HIVE-6584.1.patch, HIVE-6584.2.patch, HIVE-6584.3.patch, HIVE-6584.4.patch, HIVE-6584.5.patch HBASE-8369 provided mapreduce support for reading from HBase table snapsopts. This allows a MR job to consume a stable, read-only view of an HBase table directly off of HDFS. Bypassing the online region server API provides a nice performance boost for the full scan. HBASE-10642 is backporting that feature to 0.94/0.96 and also adding a {{mapred}} implementation. Once that's available, we should add an input format. A follow-on patch could work out how to integrate this functionality into the StorageHandler, similar to how HIVE-6473 integrates the HFileOutputFormat into existing table definitions. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-6584) Add HiveHBaseTableSnapshotInputFormat
[ https://issues.apache.org/jira/browse/HIVE-6584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14028753#comment-14028753 ] Teng Yutong commented on HIVE-6584: --- hi nick, i have some concerns about these patches: 1. HBaseStorageHandler.getInputFormatClass(): i am afraid that the returned inputformat will always be HiveHBaseTabelInputFormat (at least according to my test) 2. in the method HBaseStorageHandler.preCreateTable, hive will check whether the HBase table exist or not, regardless the external table that hive gonna create is based on actual table or a snapshot. 3. the TableSnapshotRegionSplit used in TableSnapshotInputFormat is a direct subclass of InputSplit, not a subclass of tablesplit 4. there is no public setScan method in TableSnapshotInputFormat.RecordReader, instead it will translate a string into a scan instance by using mapreduce.TableMapReduceUitls.convertStringToScan. So I suggest adding a subclass of HBaseStorageHandler(and other necessary classes) ,say HBaseSnapshotStorageHandler, to deal with the hbase snapshot situation. In fact, I have already finished the necessary code changes and done some tests. The tests show that my modification works out. i will upload my patch soon Add HiveHBaseTableSnapshotInputFormat - Key: HIVE-6584 URL: https://issues.apache.org/jira/browse/HIVE-6584 Project: Hive Issue Type: Improvement Components: HBase Handler Reporter: Nick Dimiduk Assignee: Nick Dimiduk Fix For: 0.14.0 Attachments: HIVE-6584.0.patch, HIVE-6584.1.patch, HIVE-6584.2.patch, HIVE-6584.3.patch HBASE-8369 provided mapreduce support for reading from HBase table snapsopts. This allows a MR job to consume a stable, read-only view of an HBase table directly off of HDFS. Bypassing the online region server API provides a nice performance boost for the full scan. HBASE-10642 is backporting that feature to 0.94/0.96 and also adding a {{mapred}} implementation. Once that's available, we should add an input format. A follow-on patch could work out how to integrate this functionality into the StorageHandler, similar to how HIVE-6473 integrates the HFileOutputFormat into existing table definitions. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HIVE-6584) Add HiveHBaseTableSnapshotInputFormat
[ https://issues.apache.org/jira/browse/HIVE-6584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Teng Yutong updated HIVE-6584: -- Attachment: HIVE-6584.1.patch this patch is based on the newest patch(HBASE-11137.02-0.98.patch) related HBASE-11137. Add HiveHBaseTableSnapshotInputFormat - Key: HIVE-6584 URL: https://issues.apache.org/jira/browse/HIVE-6584 Project: Hive Issue Type: Improvement Components: HBase Handler Reporter: Nick Dimiduk Assignee: Nick Dimiduk Attachments: HIVE-6584.0.patch, HIVE-6584.1.patch HBASE-8369 provided mapreduce support for reading from HBase table snapsopts. This allows a MR job to consume a stable, read-only view of an HBase table directly off of HDFS. Bypassing the online region server API provides a nice performance boost for the full scan. HBASE-10642 is backporting that feature to 0.94/0.96 and also adding a {{mapred}} implementation. Once that's available, we should add an input format. A follow-on patch could work out how to integrate this functionality into the StorageHandler, similar to how HIVE-6473 integrates the HFileOutputFormat into existing table definitions. -- This message was sent by Atlassian JIRA (v6.2#6252)