[jira] [Created] (HIVE-4250) Closing lots of RecordWriters is slow
Owen O'Malley created HIVE-4250: --- Summary: Closing lots of RecordWriters is slow Key: HIVE-4250 URL: https://issues.apache.org/jira/browse/HIVE-4250 Project: Hive Issue Type: New Feature Components: Serializers/Deserializers Reporter: Owen O'Malley Assignee: Owen O'Malley In FileSinkOperator, all of the RecordWriters are closed sequentially. For queries with a lot of dynamic partitions this can add substantially to the task time. For one query in particular, after processing all of the records in a few minutes the reduces spend 15 minutes closing all of the RC files. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4248) Implement a memory manager for ORC
[ https://issues.apache.org/jira/browse/HIVE-4248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13616691#comment-13616691 ] Owen O'Malley commented on HIVE-4248: - This may result in ORC files with smaller stripes, but that seems far better than letting the users get out of memory exceptions. > Implement a memory manager for ORC > -- > > Key: HIVE-4248 > URL: https://issues.apache.org/jira/browse/HIVE-4248 > Project: Hive > Issue Type: New Feature > Components: Serializers/Deserializers >Reporter: Owen O'Malley >Assignee: Owen O'Malley > > With the large default stripe size (256MB) and dynamic partitions, it is > quite easy for users to run out of memory when writing ORC files. We probably > need a solution that keeps track of the total number of concurrent ORC > writers and divides the available heap space between them. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HIVE-4248) Implement a memory manager for ORC
Owen O'Malley created HIVE-4248: --- Summary: Implement a memory manager for ORC Key: HIVE-4248 URL: https://issues.apache.org/jira/browse/HIVE-4248 Project: Hive Issue Type: New Feature Components: Serializers/Deserializers Reporter: Owen O'Malley Assignee: Owen O'Malley With the large default stripe size (256MB) and dynamic partitions, it is quite easy for users to run out of memory when writing ORC files. We probably need a solution that keeps track of the total number of concurrent ORC writers and divides the available heap space between them. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4227) Add column level encryption to ORC files
[ https://issues.apache.org/jira/browse/HIVE-4227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13616678#comment-13616678 ] Owen O'Malley commented on HIVE-4227: - Andrew, Yes if the code is available and provides the right API. > Add column level encryption to ORC files > > > Key: HIVE-4227 > URL: https://issues.apache.org/jira/browse/HIVE-4227 > Project: Hive > Issue Type: New Feature > Reporter: Owen O'Malley > Labels: gsoc, gsoc2013 > > It would be useful to support column level encryption in ORC files. Since > each column and its associated index is stored separately, encrypting a > column separately isn't difficult. In terms of key distribution, it would > make sense to use an external server like the one in HADOOP-9331. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4244) Make string dictionaries adaptive in ORC
[ https://issues.apache.org/jira/browse/HIVE-4244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13616657#comment-13616657 ] Owen O'Malley commented on HIVE-4244: - We should play with different values, but I was guessing the right cutover point for the heuristic was at a loading of 2 to 3 (50% to 33% distinct values). We aren't really going to know whether the heuristic is right or wrong unless we compare both encodings, which is much too expensive. By taking a good guess after looking at the start of the stripe, we can get good performance most of the time. > Make string dictionaries adaptive in ORC > > > Key: HIVE-4244 > URL: https://issues.apache.org/jira/browse/HIVE-4244 > Project: Hive > Issue Type: Bug > Components: Serializers/Deserializers >Reporter: Owen O'Malley >Assignee: Kevin Wilfong > > The ORC writer should adaptively switch between dictionary and direct > encoding. I'd propose looking at the first 100,000 values in each column and > decide whether there is sufficient loading in the dictionary to use > dictionary encoding. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4245) Implement numeric dictionaries in ORC
[ https://issues.apache.org/jira/browse/HIVE-4245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13616613#comment-13616613 ] Owen O'Malley commented on HIVE-4245: - If you look at the original ORC github, you can see a float and double redblack tree that I pulled out in getting it ready for the initial push into Apache. https://github.com/hortonworks/orc/tree/9cdb2e88d377c801655fbb9015938ea3a93e12ca/src/main/java/org/apache/hadoop/hive/ql/io/orc > Implement numeric dictionaries in ORC > - > > Key: HIVE-4245 > URL: https://issues.apache.org/jira/browse/HIVE-4245 > Project: Hive > Issue Type: New Feature > Components: Serializers/Deserializers >Reporter: Owen O'Malley >Assignee: Pamela Vagata > > For many applications, especially in de-normalized data, there is a lot of > redundancy in the numeric columns. Therefore, it would make sense to > adaptively use dictionary encodings for numeric columns in addition to string > columns. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HIVE-4246) Implement predicate pushdown for ORC
Owen O'Malley created HIVE-4246: --- Summary: Implement predicate pushdown for ORC Key: HIVE-4246 URL: https://issues.apache.org/jira/browse/HIVE-4246 Project: Hive Issue Type: New Feature Reporter: Owen O'Malley Assignee: Owen O'Malley By using the push down predicates from the table scan operator, ORC can skip over 10,000 rows at a time that won't satisfy the predicate. This will help a lot, especially if the file is sorted by the column that is used in the predicate. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (HIVE-4121) ORC should have optional dictionaries for both strings and numeric types
[ https://issues.apache.org/jira/browse/HIVE-4121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley resolved HIVE-4121. - Resolution: Duplicate I forgot I had filed this and filed the split apart on as HIVE-4244 and HIVE-4245. > ORC should have optional dictionaries for both strings and numeric types > > > Key: HIVE-4121 > URL: https://issues.apache.org/jira/browse/HIVE-4121 > Project: Hive > Issue Type: New Feature > Components: Serializers/Deserializers > Reporter: Owen O'Malley >Assignee: Owen O'Malley > > Currently string columns always have dictionaries and numerics are always > directly encoded. It would be better to make the encoding depend on a sample > of the data. Perhaps the first 100k values should be evaluated for repeated > values and the encoding picked for the stripe. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HIVE-4245) Implement numeric dictionaries in ORC
Owen O'Malley created HIVE-4245: --- Summary: Implement numeric dictionaries in ORC Key: HIVE-4245 URL: https://issues.apache.org/jira/browse/HIVE-4245 Project: Hive Issue Type: New Feature Components: Serializers/Deserializers Reporter: Owen O'Malley Assignee: Owen O'Malley For many applications, especially in de-normalized data, there is a lot of redundancy in the numeric columns. Therefore, it would make sense to adaptively use dictionary encodings for numeric columns in addition to string columns. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HIVE-4244) Make string dictionaries adaptive in ORC
Owen O'Malley created HIVE-4244: --- Summary: Make string dictionaries adaptive in ORC Key: HIVE-4244 URL: https://issues.apache.org/jira/browse/HIVE-4244 Project: Hive Issue Type: Bug Components: Serializers/Deserializers Reporter: Owen O'Malley Assignee: Owen O'Malley The ORC writer should adaptively switch between dictionary and direct encoding. I'd propose looking at the first 100,000 values in each column and decide whether there is sufficient loading in the dictionary to use dictionary encoding. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (HIVE-2162) Upgrade dependencies to Hadoop 0.20.2 and 0.20.203.0
[ https://issues.apache.org/jira/browse/HIVE-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley resolved HIVE-2162. - Resolution: Duplicate This has been fixed already. > Upgrade dependencies to Hadoop 0.20.2 and 0.20.203.0 > > > Key: HIVE-2162 > URL: https://issues.apache.org/jira/browse/HIVE-2162 > Project: Hive > Issue Type: Improvement > Reporter: Owen O'Malley > > Hadoop has released 0.20.203.0 and we should upgrade Hive's dependency to it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HIVE-4243) Fix column names in FileSinkOperator
Owen O'Malley created HIVE-4243: --- Summary: Fix column names in FileSinkOperator Key: HIVE-4243 URL: https://issues.apache.org/jira/browse/HIVE-4243 Project: Hive Issue Type: Bug Components: Serializers/Deserializers Reporter: Owen O'Malley All of the ObjectInspectors given to SerDe's by FileSinkOperator have virtual column names. Since the files are part of tables, Hive knows the column names. For self-describing file formats like ORC, having the real column names will improve the understandability. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4227) Add column level encryption to ORC files
[ https://issues.apache.org/jira/browse/HIVE-4227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13616329#comment-13616329 ] Owen O'Malley commented on HIVE-4227: - Supun, I've tagged this for Google Summer of Code. Take a look at: http://www.google-melange.com/gsoc/homepage/google/gsoc2013 > Add column level encryption to ORC files > > > Key: HIVE-4227 > URL: https://issues.apache.org/jira/browse/HIVE-4227 > Project: Hive > Issue Type: New Feature >Reporter: Owen O'Malley > Labels: gsoc, gsoc2013 > > It would be useful to support column level encryption in ORC files. Since > each column and its associated index is stored separately, encrypting a > column separately isn't difficult. In terms of key distribution, it would > make sense to use an external server like the one in HADOOP-9331. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HIVE-4242) Predicate push down should also be provided to InputFormats
Owen O'Malley created HIVE-4242: --- Summary: Predicate push down should also be provided to InputFormats Key: HIVE-4242 URL: https://issues.apache.org/jira/browse/HIVE-4242 Project: Hive Issue Type: Bug Components: StorageHandler Reporter: Owen O'Malley Assignee: Owen O'Malley Currently, the push down predicate is only provided to native tables if the hive.optimize.index.filter configuration variable is set. There is no reason to prevent InputFormats from getting the required information to do predicate push down. Obviously, this will be very useful for ORC. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HIVE-4229) Create a hive-ql jar that doesn't include non-hive jars
Owen O'Malley created HIVE-4229: --- Summary: Create a hive-ql jar that doesn't include non-hive jars Key: HIVE-4229 URL: https://issues.apache.org/jira/browse/HIVE-4229 Project: Hive Issue Type: New Feature Reporter: Owen O'Malley We currently only ship the ql module as part of the hive-exec jar that includes other projects (thrift, avro, protobuf, commons lang, json, java-ewah, and javolution). This forces downstream users to get the upstream projects too. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Question - why are there instances of org.apache.commons.lang.StringUtils and WordUtils bundled in hive?
You're right. I was thinking there was a hive-ql jar, but there isn't. (Note that they aren't duplicated in the source tree, just packaged up in the jar.) I've created https://issues.apache.org/jira/browse/HIVE-4229 to provide a jar of the ql classes without the upstream classes included. Note that in the long stream, I think we need to simplify the jars, but that is a bigger issue. -- Owen On Mon, Mar 25, 2013 at 3:23 PM, Dave Winterbourne < dave.winterbou...@gmail.com> wrote: > We have a custom User Defined Function that extends UDF - I'll admit some > ignorance, as I inherited this code, but UDF is a class that comes from > hive-exec, so it doesn't seem true that hive-exec is not intended for > external usage. That having been said, my original question is why there > are classes from commons-lang that are simply duplicated in the code base. > This is bad form at best, but causes class collisions and thus duplicate > class warnings. > > On Mon, Mar 25, 2013 at 2:48 PM, Owen O'Malley wrote: > > > Hive-exec isn't meant for external usage. It is the bundled jar of Hive's > > runtime dependencies that are required for Hive's MapReduce tasks. It > > consists of : > > > > hive-common > > hive-ql > > hive-serde > > hive-shims > > thrift > > commons-lang > > json > > avro > > avro-mapred > > java-ewah > > javolution > > protobuf-java > > > > -- Owen > > > > > > On Mon, Mar 25, 2013 at 11:42 AM, Dave Winterbourne < > > dave.winterbou...@gmail.com> wrote: > > > > > I have been working on eliminating duplicate class warnings in my maven > > > build, and in the end discovered that there are two classes from apache > > > commons-lang that are bundled with hive-exec: > > > > > > jar tf hive-0.10.0-bin//lib/hive-exec-0.10.0.jar | grep > > > org/apache/commons/lang/ > > > org/apache/commons/lang/ > > > org/apache/commons/lang/StringUtils.class > > > org/apache/commons/lang/WordUtils.class > > > > > > Why are these classes bundled with hive as opposed to just using > > > commons-lang? If there truly is a need for custom functionality, why > not > > > put it in a different class to avoid this collision? > > > > > >
Re: Question - why are there instances of org.apache.commons.lang.StringUtils and WordUtils bundled in hive?
Hive-exec isn't meant for external usage. It is the bundled jar of Hive's runtime dependencies that are required for Hive's MapReduce tasks. It consists of : hive-common hive-ql hive-serde hive-shims thrift commons-lang json avro avro-mapred java-ewah javolution protobuf-java -- Owen On Mon, Mar 25, 2013 at 11:42 AM, Dave Winterbourne < dave.winterbou...@gmail.com> wrote: > I have been working on eliminating duplicate class warnings in my maven > build, and in the end discovered that there are two classes from apache > commons-lang that are bundled with hive-exec: > > jar tf hive-0.10.0-bin//lib/hive-exec-0.10.0.jar | grep > org/apache/commons/lang/ > org/apache/commons/lang/ > org/apache/commons/lang/StringUtils.class > org/apache/commons/lang/WordUtils.class > > Why are these classes bundled with hive as opposed to just using > commons-lang? If there truly is a need for custom functionality, why not > put it in a different class to avoid this collision? >
[jira] [Created] (HIVE-4227) Add column level encryption to ORC files
Owen O'Malley created HIVE-4227: --- Summary: Add column level encryption to ORC files Key: HIVE-4227 URL: https://issues.apache.org/jira/browse/HIVE-4227 Project: Hive Issue Type: New Feature Reporter: Owen O'Malley It would be useful to support column level encryption in ORC files. Since each column and its associated index is stored separately, encrypting a column separately isn't difficult. In terms of key distribution, it would make sense to use an external server like the one in HADOOP-9331. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4114) hive-metastore.jar depends on jdo2-api:jar:2.3-ec, which is missing in maven central
[ https://issues.apache.org/jira/browse/HIVE-4114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13601460#comment-13601460 ] Owen O'Malley commented on HIVE-4114: - You'll also need to install the jdo2 jar in your maven repository: {code} download jdo2-api-2.3-ec.jar to your working directory mvn install:install-file -DgroupId=javax.jdo -DartifactId=jdo2-api -Dversion=2.3-ec -Dpackaging=jar -Dfile=jdo2-api-2.3-ec.jar {code} The new jdo jar is available from http://www.datanucleus.org/downloads/maven2/javax/jdo/jdo2-api/2.3-ec/jdo2-api-2.3-ec.jar > hive-metastore.jar depends on jdo2-api:jar:2.3-ec, which is missing in maven > central > > > Key: HIVE-4114 > URL: https://issues.apache.org/jira/browse/HIVE-4114 > Project: Hive > Issue Type: Bug > Components: Build Infrastructure >Reporter: Gopal V >Priority: Trivial > > Adding hive-exec-0.10.0 to an independent pom.xml results in the following > error > {code} > Failed to retrieve javax.jdo:jdo2-api-2.3-ec > Caused by: Could not find artifact javax.jdo:jdo2-api:jar:2.3-ec in central > (http://repo1.maven.org/maven2) > ... > Path to dependency: > 1) org.notmysock.hive:plan-viewer:jar:1.0-SNAPSHOT > 2) org.apache.hive:hive-exec:jar:0.10.0 > 3) org.apache.hive:hive-metastore:jar:0.10.0 > 4) javax.jdo:jdo2-api:jar:2.3-ec > {code} > From the best I could tell, in the hive build ant+ivy pulls this file from > the datanucleus repo > http://www.datanucleus.org/downloads/maven2/javax/jdo/jdo2-api/2.3-ec/ > For completeness sake, the dependency needs to be pulled to maven central. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4156) need to add protobuf classes to hive-exec.jar
[ https://issues.apache.org/jira/browse/HIVE-4156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13601257#comment-13601257 ] Owen O'Malley commented on HIVE-4156: - No worries, but thanks for removing the -1. Ironically, some of the testing for ORC is happening under Hadoop v2 where the issue doesn't come up since Hadoop v2 bundles protobuf. > need to add protobuf classes to hive-exec.jar > - > > Key: HIVE-4156 > URL: https://issues.apache.org/jira/browse/HIVE-4156 > Project: Hive > Issue Type: Bug > Components: Serializers/Deserializers >Reporter: Owen O'Malley >Assignee: Owen O'Malley > Attachments: HIVE-4156.D9375.1.patch > > > In some queries, the tasks fail when they can't find classes from the > protobuf library. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4156) need to add protobuf classes to hive-exec.jar
[ https://issues.apache.org/jira/browse/HIVE-4156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13601246#comment-13601246 ] Owen O'Malley commented on HIVE-4156: - ORC does require protobuf, which is exactly how I hit this. > need to add protobuf classes to hive-exec.jar > - > > Key: HIVE-4156 > URL: https://issues.apache.org/jira/browse/HIVE-4156 > Project: Hive > Issue Type: Bug > Components: Serializers/Deserializers >Reporter: Owen O'Malley >Assignee: Owen O'Malley > Attachments: HIVE-4156.D9375.1.patch > > > In some queries, the tasks fail when they can't find classes from the > protobuf library. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-4156) need to add protobuf classes to hive-exec.jar
[ https://issues.apache.org/jira/browse/HIVE-4156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley updated HIVE-4156: Status: Patch Available (was: Open) > need to add protobuf classes to hive-exec.jar > - > > Key: HIVE-4156 > URL: https://issues.apache.org/jira/browse/HIVE-4156 > Project: Hive > Issue Type: Bug > Components: Serializers/Deserializers > Reporter: Owen O'Malley >Assignee: Owen O'Malley > Attachments: HIVE-4156.D9375.1.patch > > > In some queries, the tasks fail when they can't find classes from the > protobuf library. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-4138) ORC's union object inspector returns a type name that isn't parseable by TypeInfoUtils
[ https://issues.apache.org/jira/browse/HIVE-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley updated HIVE-4138: Attachment: h-4138.patch This updates the patch since the decimal reader/writer went in. > ORC's union object inspector returns a type name that isn't parseable by > TypeInfoUtils > -- > > Key: HIVE-4138 > URL: https://issues.apache.org/jira/browse/HIVE-4138 > Project: Hive > Issue Type: Bug > Components: Serializers/Deserializers >Reporter: Owen O'Malley >Assignee: Owen O'Malley > Attachments: h-4138.patch, HIVE-4138.D9219.1.patch > > > Currently the typename returned by ORC's union object inspector isn't > parseable by TypeInfoUtils. The format needs to be union. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-4138) ORC's union object inspector returns a type name that isn't parseable by TypeInfoUtils
[ https://issues.apache.org/jira/browse/HIVE-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley updated HIVE-4138: Status: Patch Available (was: Open) > ORC's union object inspector returns a type name that isn't parseable by > TypeInfoUtils > -- > > Key: HIVE-4138 > URL: https://issues.apache.org/jira/browse/HIVE-4138 > Project: Hive > Issue Type: Bug > Components: Serializers/Deserializers >Reporter: Owen O'Malley >Assignee: Owen O'Malley > Attachments: h-4138.patch, HIVE-4138.D9219.1.patch > > > Currently the typename returned by ORC's union object inspector isn't > parseable by TypeInfoUtils. The format needs to be union. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-4156) need to add protobuf classes to hive-exec.jar
[ https://issues.apache.org/jira/browse/HIVE-4156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley updated HIVE-4156: Component/s: Serializers/Deserializers > need to add protobuf classes to hive-exec.jar > - > > Key: HIVE-4156 > URL: https://issues.apache.org/jira/browse/HIVE-4156 > Project: Hive > Issue Type: Bug > Components: Serializers/Deserializers > Reporter: Owen O'Malley >Assignee: Owen O'Malley > > In some queries, the tasks fail when they can't find classes from the > protobuf library. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HIVE-4156) need to add protobuf classes to hive-exec.jar
Owen O'Malley created HIVE-4156: --- Summary: need to add protobuf classes to hive-exec.jar Key: HIVE-4156 URL: https://issues.apache.org/jira/browse/HIVE-4156 Project: Hive Issue Type: Bug Reporter: Owen O'Malley Assignee: Owen O'Malley In some queries, the tasks fail when they can't find classes from the protobuf library. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-4138) ORC's union object inspector returns a type name that isn't parseable by TypeInfoUtils
[ https://issues.apache.org/jira/browse/HIVE-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley updated HIVE-4138: Status: Patch Available (was: Open) > ORC's union object inspector returns a type name that isn't parseable by > TypeInfoUtils > -- > > Key: HIVE-4138 > URL: https://issues.apache.org/jira/browse/HIVE-4138 > Project: Hive > Issue Type: Bug > Components: Serializers/Deserializers >Reporter: Owen O'Malley >Assignee: Owen O'Malley > Attachments: HIVE-4138.D9219.1.patch > > > Currently the typename returned by ORC's union object inspector isn't > parseable by TypeInfoUtils. The format needs to be union. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-4138) ORC's union object inspector returns a type name that isn't parseable by TypeInfoUtils
[ https://issues.apache.org/jira/browse/HIVE-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley updated HIVE-4138: Component/s: Serializers/Deserializers > ORC's union object inspector returns a type name that isn't parseable by > TypeInfoUtils > -- > > Key: HIVE-4138 > URL: https://issues.apache.org/jira/browse/HIVE-4138 > Project: Hive > Issue Type: Bug > Components: Serializers/Deserializers >Reporter: Owen O'Malley >Assignee: Owen O'Malley > > Currently the typename returned by ORC's union object inspector isn't > parseable by TypeInfoUtils. The format needs to be union. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-4120) Implement decimal encoding for ORC
[ https://issues.apache.org/jira/browse/HIVE-4120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley updated HIVE-4120: Status: Patch Available (was: Open) > Implement decimal encoding for ORC > -- > > Key: HIVE-4120 > URL: https://issues.apache.org/jira/browse/HIVE-4120 > Project: Hive > Issue Type: New Feature > Components: Serializers/Deserializers > Reporter: Owen O'Malley >Assignee: Owen O'Malley > Attachments: HIVE-4120.D9207.1.patch > > > Currently, ORC does not have an encoder for decimal. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HIVE-4138) ORC's union object inspector returns a type name that isn't parseable by TypeInfoUtils
Owen O'Malley created HIVE-4138: --- Summary: ORC's union object inspector returns a type name that isn't parseable by TypeInfoUtils Key: HIVE-4138 URL: https://issues.apache.org/jira/browse/HIVE-4138 Project: Hive Issue Type: Bug Reporter: Owen O'Malley Currently the typename returned by ORC's union object inspector isn't parseable by TypeInfoUtils. The format needs to be union. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HIVE-4138) ORC's union object inspector returns a type name that isn't parseable by TypeInfoUtils
[ https://issues.apache.org/jira/browse/HIVE-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley reassigned HIVE-4138: --- Assignee: Owen O'Malley > ORC's union object inspector returns a type name that isn't parseable by > TypeInfoUtils > -- > > Key: HIVE-4138 > URL: https://issues.apache.org/jira/browse/HIVE-4138 > Project: Hive > Issue Type: Bug >Reporter: Owen O'Malley >Assignee: Owen O'Malley > > Currently the typename returned by ORC's union object inspector isn't > parseable by TypeInfoUtils. The format needs to be union. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-4127) Testing with Hadoop 2.x causes test failure for ORC's TestFileDump
[ https://issues.apache.org/jira/browse/HIVE-4127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley updated HIVE-4127: Status: Patch Available (was: Open) > Testing with Hadoop 2.x causes test failure for ORC's TestFileDump > -- > > Key: HIVE-4127 > URL: https://issues.apache.org/jira/browse/HIVE-4127 > Project: Hive > Issue Type: New Feature > Components: Serializers/Deserializers >Reporter: Owen O'Malley >Assignee: Owen O'Malley > Attachments: HIVE-4127.D9111.1.patch > > > Hadoop 2's junit is a newer version, which causes differences in behaviors of > the TestFileDump. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4015) Add ORC file to the grammar as a file format
[ https://issues.apache.org/jira/browse/HIVE-4015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13593987#comment-13593987 ] Owen O'Malley commented on HIVE-4015: - +1 looks good to me. > Add ORC file to the grammar as a file format > > > Key: HIVE-4015 > URL: https://issues.apache.org/jira/browse/HIVE-4015 > Project: Hive > Issue Type: Improvement > Reporter: Owen O'Malley >Assignee: Gunther Hagleitner > Attachments: HIVE-4015.1.patch, HIVE-4015.2.patch, HIVE-4015.3.patch, > HIVE-4015.4.patch > > > It would be much more convenient for users if we enable them to use ORC as a > file format in the HQL grammar. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HIVE-4127) Testing with Hadoop 2.x causes test failure for ORC's TestFileDump
Owen O'Malley created HIVE-4127: --- Summary: Testing with Hadoop 2.x causes test failure for ORC's TestFileDump Key: HIVE-4127 URL: https://issues.apache.org/jira/browse/HIVE-4127 Project: Hive Issue Type: New Feature Components: Serializers/Deserializers Reporter: Owen O'Malley Assignee: Owen O'Malley Hadoop 2's junit is a newer version, which causes differences in behaviors of the TestFileDump. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (HIVE-2899) Remove dependency on sun's jdk.
[ https://issues.apache.org/jira/browse/HIVE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley resolved HIVE-2899. - Resolution: Invalid I'm closing this. > Remove dependency on sun's jdk. > --- > > Key: HIVE-2899 > URL: https://issues.apache.org/jira/browse/HIVE-2899 > Project: Hive > Issue Type: Improvement >Reporter: Owen O'Malley >Assignee: Owen O'Malley > > When the signal handlers were added, they introduced a dependency on > sun.misc.Signal and sun.misc.SignalHandler. We can look these classes up by > reflection and avoid the warning and also provide a soft-fail for non-sun > jvms. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HIVE-4123) The RLE encoding for ORC can be improved
Owen O'Malley created HIVE-4123: --- Summary: The RLE encoding for ORC can be improved Key: HIVE-4123 URL: https://issues.apache.org/jira/browse/HIVE-4123 Project: Hive Issue Type: New Feature Components: Serializers/Deserializers Reporter: Owen O'Malley Assignee: Owen O'Malley The run length encoding of integers can be improved: * tighter bit packing * allow delta encoding * allow longer runs -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-4121) ORC should have optional dictionaries for both strings and numeric types
[ https://issues.apache.org/jira/browse/HIVE-4121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley updated HIVE-4121: Description: Currently string columns always have dictionaries and numerics are always directly encoded. It would be better to make the encoding depend on a sample of the data. Perhaps the first 100k values should be evaluated for repeated values and the encoding picked for the stripe. > ORC should have optional dictionaries for both strings and numeric types > > > Key: HIVE-4121 > URL: https://issues.apache.org/jira/browse/HIVE-4121 > Project: Hive > Issue Type: New Feature > Components: Serializers/Deserializers > Reporter: Owen O'Malley >Assignee: Owen O'Malley > > Currently string columns always have dictionaries and numerics are always > directly encoded. It would be better to make the encoding depend on a sample > of the data. Perhaps the first 100k values should be evaluated for repeated > values and the encoding picked for the stripe. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HIVE-4120) Implement decimal encoding for ORC
Owen O'Malley created HIVE-4120: --- Summary: Implement decimal encoding for ORC Key: HIVE-4120 URL: https://issues.apache.org/jira/browse/HIVE-4120 Project: Hive Issue Type: New Feature Components: Serializers/Deserializers Reporter: Owen O'Malley Assignee: Owen O'Malley Currently, ORC does not have an encoder for decimal. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HIVE-4121) ORC should have optional dictionaries for both strings and numeric types
Owen O'Malley created HIVE-4121: --- Summary: ORC should have optional dictionaries for both strings and numeric types Key: HIVE-4121 URL: https://issues.apache.org/jira/browse/HIVE-4121 Project: Hive Issue Type: New Feature Components: Serializers/Deserializers Reporter: Owen O'Malley Assignee: Owen O'Malley -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4113) select count(1) reads all columns with RCFile
[ https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13592751#comment-13592751 ] Owen O'Malley commented on HIVE-4113: - There are a couple of context where Hive assumes that an empty string means all columns. Those as well as the code in ORC and RCFile will need to be fixed. > select count(1) reads all columns with RCFile > - > > Key: HIVE-4113 > URL: https://issues.apache.org/jira/browse/HIVE-4113 > Project: Hive > Issue Type: Bug >Reporter: Gopal V > > select count(1) loads up every column & every row when used with RCFile. > "select count(1) from store_sales_10_rc" gives > {code} > Job 0: Map: 5 Reduce: 1 Cumulative CPU: 31.73 sec HDFS Read: 234914410 > HDFS Write: 8 SUCCESS > {code} > Where as, "select count(ss_sold_date_sk) from store_sales_10_rc;" reads far > less > {code} > Job 0: Map: 5 Reduce: 1 Cumulative CPU: 29.75 sec HDFS Read: 28145994 > HDFS Write: 8 SUCCESS > {code} > Which is 11% of the data size read by the COUNT(1). > This was tracked down to the following code in RCFile.java > {code} > } else { > // TODO: if no column name is specified e.g, in select count(1) from > tt; > // skip all columns, this should be distinguished from the case: > // select * from tt; > for (int i = 0; i < skippedColIDs.length; i++) { > skippedColIDs[i] = false; > } > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-4098) OrcInputFormat assumes Hive always calls createValue
[ https://issues.apache.org/jira/browse/HIVE-4098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley updated HIVE-4098: Status: Patch Available (was: Open) The patch removes the assumption of a dedicated row for each RecordReader. > OrcInputFormat assumes Hive always calls createValue > > > Key: HIVE-4098 > URL: https://issues.apache.org/jira/browse/HIVE-4098 > Project: Hive > Issue Type: Bug > Components: Serializers/Deserializers > Reporter: Owen O'Malley >Assignee: Owen O'Malley > Attachments: HIVE-4098.D9021.1.patch > > > Hive's HiveContextAwareRecordReader doesn't create a new value for each > InputFormat and instead reuses the same row between input formats. That > causes the first record of second (and third, etc.) partition to be dropped > and replaced with the last row of the previous partition. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-4097) ORC file doesn't properly interpret empty hive.io.file.readcolumn.ids
[ https://issues.apache.org/jira/browse/HIVE-4097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley updated HIVE-4097: Status: Patch Available (was: Open) This patch fixes the problem and adds a test case to ensure that the empty string is correctly handled. > ORC file doesn't properly interpret empty hive.io.file.readcolumn.ids > - > > Key: HIVE-4097 > URL: https://issues.apache.org/jira/browse/HIVE-4097 > Project: Hive > Issue Type: Bug > Components: Serializers/Deserializers >Reporter: Owen O'Malley >Assignee: Owen O'Malley > Attachments: HIVE-4097.D9015.1.patch > > > Hive assumes that an empty string in hive.io.file.readcolumn.ids means all > columns. The ORC reader currently assumes it means no columns. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4015) Add ORC file to the grammar as a file format
[ https://issues.apache.org/jira/browse/HIVE-4015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13590722#comment-13590722 ] Owen O'Malley commented on HIVE-4015: - Gunther, this looks good. I'd suggest removing the code that lets you override the serde, since with ORC you really don't want to do that. > Add ORC file to the grammar as a file format > > > Key: HIVE-4015 > URL: https://issues.apache.org/jira/browse/HIVE-4015 > Project: Hive > Issue Type: Improvement >Reporter: Owen O'Malley >Assignee: Gunther Hagleitner > Attachments: HIVE-4015.1.patch, HIVE-4015.2.patch, HIVE-4015.3.patch > > > It would be much more convenient for users if we enable them to use ORC as a > file format in the HQL grammar. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HIVE-4098) OrcInputFormat assumes Hive always calls createValue
Owen O'Malley created HIVE-4098: --- Summary: OrcInputFormat assumes Hive always calls createValue Key: HIVE-4098 URL: https://issues.apache.org/jira/browse/HIVE-4098 Project: Hive Issue Type: Bug Components: Serializers/Deserializers Reporter: Owen O'Malley Assignee: Owen O'Malley Hive's HiveContextAwareRecordReader doesn't create a new value for each InputFormat and instead reuses the same row between input formats. That causes the first record of second (and third, etc.) partition to be dropped and replaced with the last row of the previous partition. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HIVE-4097) ORC file doesn't properly interpret empty hive.io.file.readcolumn.ids
Owen O'Malley created HIVE-4097: --- Summary: ORC file doesn't properly interpret empty hive.io.file.readcolumn.ids Key: HIVE-4097 URL: https://issues.apache.org/jira/browse/HIVE-4097 Project: Hive Issue Type: Bug Components: Serializers/Deserializers Reporter: Owen O'Malley Assignee: Owen O'Malley Hive assumes that an empty string in hive.io.file.readcolumn.ids means all columns. The ORC reader currently assumes it means no columns. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-3874) Create a new Optimized Row Columnar file format for Hive
[ https://issues.apache.org/jira/browse/HIVE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley updated HIVE-3874: Status: Patch Available (was: Open) Pamela, Yeah, that probably makes sense. I'll file the follow up jiras. > Create a new Optimized Row Columnar file format for Hive > > > Key: HIVE-3874 > URL: https://issues.apache.org/jira/browse/HIVE-3874 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Reporter: Owen O'Malley >Assignee: Owen O'Malley > Attachments: hive.3874.2.patch, HIVE-3874.D8529.1.patch, > HIVE-3874.D8529.2.patch, HIVE-3874.D8529.3.patch, HIVE-3874.D8529.4.patch, > HIVE-3874.D8871.1.patch, OrcFileIntro.pptx, orc.tgz > > > There are several limitations of the current RC File format that I'd like to > address by creating a new format: > * each column value is stored as a binary blob, which means: > ** the entire column value must be read, decompressed, and deserialized > ** the file format can't use smarter type-specific compression > ** push down filters can't be evaluated > * the start of each row group needs to be found by scanning > * user metadata can only be added to the file when the file is created > * the file doesn't store the number of rows per a file or row group > * there is no mechanism for seeking to a particular row number, which is > required for external indexes. > * there is no mechanism for storing light weight indexes within the file to > enable push-down filters to skip entire row groups. > * the type of the rows aren't stored in the file -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-3874) Create a new Optimized Row Columnar file format for Hive
[ https://issues.apache.org/jira/browse/HIVE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13588983#comment-13588983 ] Owen O'Malley commented on HIVE-3874: - I'm actually tracking down a bug that Gunther found with a query. Let me finish track it down. > Create a new Optimized Row Columnar file format for Hive > > > Key: HIVE-3874 > URL: https://issues.apache.org/jira/browse/HIVE-3874 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Reporter: Owen O'Malley >Assignee: Owen O'Malley > Attachments: hive.3874.2.patch, HIVE-3874.D8529.1.patch, > HIVE-3874.D8529.2.patch, HIVE-3874.D8529.3.patch, HIVE-3874.D8529.4.patch, > HIVE-3874.D8871.1.patch, OrcFileIntro.pptx, orc.tgz > > > There are several limitations of the current RC File format that I'd like to > address by creating a new format: > * each column value is stored as a binary blob, which means: > ** the entire column value must be read, decompressed, and deserialized > ** the file format can't use smarter type-specific compression > ** push down filters can't be evaluated > * the start of each row group needs to be found by scanning > * user metadata can only be added to the file when the file is created > * the file doesn't store the number of rows per a file or row group > * there is no mechanism for seeking to a particular row number, which is > required for external indexes. > * there is no mechanism for storing light weight indexes within the file to > enable push-down filters to skip entire row groups. > * the type of the rows aren't stored in the file -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (HIVE-4058) make ORC versioned
[ https://issues.apache.org/jira/browse/HIVE-4058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley resolved HIVE-4058. - Resolution: Won't Fix > make ORC versioned > -- > > Key: HIVE-4058 > URL: https://issues.apache.org/jira/browse/HIVE-4058 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Reporter: Namit Jain > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (HIVE-4061) skip columns which are not accessed in the query for ORC
[ https://issues.apache.org/jira/browse/HIVE-4061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley resolved HIVE-4061. - Resolution: Cannot Reproduce This is already done. > skip columns which are not accessed in the query for ORC > > > Key: HIVE-4061 > URL: https://issues.apache.org/jira/browse/HIVE-4061 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Reporter: Namit Jain > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4058) make ORC versioned
[ https://issues.apache.org/jira/browse/HIVE-4058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13586398#comment-13586398 ] Owen O'Malley commented on HIVE-4058: - I should also note that if it is required at some point, we can always create such a field in the footer and treat that missing field as a version 0. > make ORC versioned > -- > > Key: HIVE-4058 > URL: https://issues.apache.org/jira/browse/HIVE-4058 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Reporter: Namit Jain > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HIVE-4059) Make Column statistics for ORC optional
[ https://issues.apache.org/jira/browse/HIVE-4059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley reassigned HIVE-4059: --- Assignee: Owen O'Malley > Make Column statistics for ORC optional > --- > > Key: HIVE-4059 > URL: https://issues.apache.org/jira/browse/HIVE-4059 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Reporter: Namit Jain >Assignee: Owen O'Malley > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4058) make ORC versioned
[ https://issues.apache.org/jira/browse/HIVE-4058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13586021#comment-13586021 ] Owen O'Malley commented on HIVE-4058: - The metadata is versioned, it just doesn't have a global version. The intent is that new fields can be added to the protobuf and the reader will check if those new fields are defined. > make ORC versioned > -- > > Key: HIVE-4058 > URL: https://issues.apache.org/jira/browse/HIVE-4058 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Reporter: Namit Jain > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-3874) Create a new Optimized Row Columnar file format for Hive
[ https://issues.apache.org/jira/browse/HIVE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley updated HIVE-3874: Status: Patch Available (was: Open) > Create a new Optimized Row Columnar file format for Hive > > > Key: HIVE-3874 > URL: https://issues.apache.org/jira/browse/HIVE-3874 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers > Reporter: Owen O'Malley >Assignee: Owen O'Malley > Attachments: hive.3874.2.patch, HIVE-3874.D8529.1.patch, > HIVE-3874.D8529.2.patch, HIVE-3874.D8529.3.patch, HIVE-3874.D8529.4.patch, > HIVE-3874.D8871.1.patch, OrcFileIntro.pptx, orc.tgz > > > There are several limitations of the current RC File format that I'd like to > address by creating a new format: > * each column value is stored as a binary blob, which means: > ** the entire column value must be read, decompressed, and deserialized > ** the file format can't use smarter type-specific compression > ** push down filters can't be evaluated > * the start of each row group needs to be found by scanning > * user metadata can only be added to the file when the file is created > * the file doesn't store the number of rows per a file or row group > * there is no mechanism for seeking to a particular row number, which is > required for external indexes. > * there is no mechanism for storing light weight indexes within the file to > enable push-down filters to skip entire row groups. > * the type of the rows aren't stored in the file -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-3874) Create a new Optimized Row Columnar file format for Hive
[ https://issues.apache.org/jira/browse/HIVE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13586018#comment-13586018 ] Owen O'Malley commented on HIVE-3874: - Ok, I added some additional comments in the Writer as Namit asked and all of the unit tests cases pass. > Create a new Optimized Row Columnar file format for Hive > > > Key: HIVE-3874 > URL: https://issues.apache.org/jira/browse/HIVE-3874 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Reporter: Owen O'Malley >Assignee: Owen O'Malley > Attachments: hive.3874.2.patch, HIVE-3874.D8529.1.patch, > HIVE-3874.D8529.2.patch, HIVE-3874.D8529.3.patch, HIVE-3874.D8529.4.patch, > HIVE-3874.D8871.1.patch, OrcFileIntro.pptx, orc.tgz > > > There are several limitations of the current RC File format that I'd like to > address by creating a new format: > * each column value is stored as a binary blob, which means: > ** the entire column value must be read, decompressed, and deserialized > ** the file format can't use smarter type-specific compression > ** push down filters can't be evaluated > * the start of each row group needs to be found by scanning > * user metadata can only be added to the file when the file is created > * the file doesn't store the number of rows per a file or row group > * there is no mechanism for seeking to a particular row number, which is > required for external indexes. > * there is no mechanism for storing light weight indexes within the file to > enable push-down filters to skip entire row groups. > * the type of the rows aren't stored in the file -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4000) Hive client goes into infinite loop at 100% cpu
[ https://issues.apache.org/jira/browse/HIVE-4000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13579591#comment-13579591 ] Owen O'Malley commented on HIVE-4000: - The kind of query that is creating the problem looks like: {code} from Tbl insert ... insert ... {code} The customer sees the problem with 50 or more inserts. > Hive client goes into infinite loop at 100% cpu > --- > > Key: HIVE-4000 > URL: https://issues.apache.org/jira/browse/HIVE-4000 > Project: Hive > Issue Type: Bug >Affects Versions: 0.9.0 >Reporter: Owen O'Malley >Assignee: Owen O'Malley > Fix For: 0.10.1 > > Attachments: HIVE-4000.D8493.1.patch > > > The Hive client starts multiple threads to track the progress of the > MapReduce jobs. Unfortunately those threads access several static HashMaps > that are not protected by locks. When the HashMaps are modified, they > sometimes cause race conditions that lead to the client threads getting stuck > in infinite loops. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-3874) Create a new Optimized Row Columnar file format for Hive
[ https://issues.apache.org/jira/browse/HIVE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley updated HIVE-3874: Status: Patch Available (was: Open) > Create a new Optimized Row Columnar file format for Hive > > > Key: HIVE-3874 > URL: https://issues.apache.org/jira/browse/HIVE-3874 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers > Reporter: Owen O'Malley >Assignee: Owen O'Malley > Attachments: hive.3874.2.patch, HIVE-3874.D8529.1.patch, > OrcFileIntro.pptx, orc.tgz > > > There are several limitations of the current RC File format that I'd like to > address by creating a new format: > * each column value is stored as a binary blob, which means: > ** the entire column value must be read, decompressed, and deserialized > ** the file format can't use smarter type-specific compression > ** push down filters can't be evaluated > * the start of each row group needs to be found by scanning > * user metadata can only be added to the file when the file is created > * the file doesn't store the number of rows per a file or row group > * there is no mechanism for seeking to a particular row number, which is > required for external indexes. > * there is no mechanism for storing light weight indexes within the file to > enable push-down filters to skip entire row groups. > * the type of the rows aren't stored in the file -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HIVE-4015) Add ORC file to the grammar as a file format
[ https://issues.apache.org/jira/browse/HIVE-4015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley reassigned HIVE-4015: --- Assignee: Owen O'Malley > Add ORC file to the grammar as a file format > > > Key: HIVE-4015 > URL: https://issues.apache.org/jira/browse/HIVE-4015 > Project: Hive > Issue Type: Improvement > Reporter: Owen O'Malley >Assignee: Owen O'Malley > > It would be much more convenient for users if we enable them to use ORC as a > file format in the HQL grammar. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HIVE-4015) Add ORC file to the grammar as a file format
Owen O'Malley created HIVE-4015: --- Summary: Add ORC file to the grammar as a file format Key: HIVE-4015 URL: https://issues.apache.org/jira/browse/HIVE-4015 Project: Hive Issue Type: Improvement Reporter: Owen O'Malley It would be much more convenient for users if we enable them to use ORC as a file format in the HQL grammar. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-3874) Create a new Optimized Row Columnar file format for Hive
[ https://issues.apache.org/jira/browse/HIVE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13576691#comment-13576691 ] Owen O'Malley commented on HIVE-3874: - Kevin, I had some distractions at work, but I should get the patch uploaded today. > Create a new Optimized Row Columnar file format for Hive > > > Key: HIVE-3874 > URL: https://issues.apache.org/jira/browse/HIVE-3874 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Reporter: Owen O'Malley >Assignee: Owen O'Malley > Attachments: hive.3874.2.patch, OrcFileIntro.pptx, orc.tgz > > > There are several limitations of the current RC File format that I'd like to > address by creating a new format: > * each column value is stored as a binary blob, which means: > ** the entire column value must be read, decompressed, and deserialized > ** the file format can't use smarter type-specific compression > ** push down filters can't be evaluated > * the start of each row group needs to be found by scanning > * user metadata can only be added to the file when the file is created > * the file doesn't store the number of rows per a file or row group > * there is no mechanism for seeking to a particular row number, which is > required for external indexes. > * there is no mechanism for storing light weight indexes within the file to > enable push-down filters to skip entire row groups. > * the type of the rows aren't stored in the file -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-4000) Hive client goes into infinite loop at 100% cpu
[ https://issues.apache.org/jira/browse/HIVE-4000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley updated HIVE-4000: Status: Patch Available (was: Open) Replace the sets/maps with concurrent versions that protect against concurrent access from multiple threads. > Hive client goes into infinite loop at 100% cpu > --- > > Key: HIVE-4000 > URL: https://issues.apache.org/jira/browse/HIVE-4000 > Project: Hive > Issue Type: Bug >Affects Versions: 0.9.0 > Reporter: Owen O'Malley >Assignee: Owen O'Malley > Fix For: 0.10.1 > > Attachments: HIVE-4000.D8493.1.patch > > > The Hive client starts multiple threads to track the progress of the > MapReduce jobs. Unfortunately those threads access several static HashMaps > that are not protected by locks. When the HashMaps are modified, they > sometimes cause race conditions that lead to the client threads getting stuck > in infinite loops. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HIVE-4000) Hive client goes into infinite loop at 100% cpu
Owen O'Malley created HIVE-4000: --- Summary: Hive client goes into infinite loop at 100% cpu Key: HIVE-4000 URL: https://issues.apache.org/jira/browse/HIVE-4000 Project: Hive Issue Type: Bug Affects Versions: 0.9.0 Reporter: Owen O'Malley Assignee: Owen O'Malley Fix For: 0.10.1 The Hive client starts multiple threads to track the progress of the MapReduce jobs. Unfortunately those threads access several static HashMaps that are not protected by locks. When the HashMaps are modified, they sometimes cause race conditions that lead to the client threads getting stuck in infinite loops. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-3874) Create a new Optimized Row Columnar file format for Hive
[ https://issues.apache.org/jira/browse/HIVE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13572623#comment-13572623 ] Owen O'Malley commented on HIVE-3874: - I've pushed the current version up to [github|http://github.com/hortonworks/orc] with the seek to record implemented. Does it make more sense to put ORC into serde or ql? RCFile is in ql, so I'd assumed it would go there. Thoughts? > Create a new Optimized Row Columnar file format for Hive > > > Key: HIVE-3874 > URL: https://issues.apache.org/jira/browse/HIVE-3874 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Reporter: Owen O'Malley >Assignee: Owen O'Malley > Attachments: hive.3874.2.patch, OrcFileIntro.pptx, orc.tgz > > > There are several limitations of the current RC File format that I'd like to > address by creating a new format: > * each column value is stored as a binary blob, which means: > ** the entire column value must be read, decompressed, and deserialized > ** the file format can't use smarter type-specific compression > ** push down filters can't be evaluated > * the start of each row group needs to be found by scanning > * user metadata can only be added to the file when the file is created > * the file doesn't store the number of rows per a file or row group > * there is no mechanism for seeking to a particular row number, which is > required for external indexes. > * there is no mechanism for storing light weight indexes within the file to > enable push-down filters to skip entire row groups. > * the type of the rows aren't stored in the file -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-3874) Create a new Optimized Row Columnar file format for Hive
[ https://issues.apache.org/jira/browse/HIVE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13571510#comment-13571510 ] Owen O'Malley commented on HIVE-3874: - [~kevinwilfong] Thanks for the bug fixes, Kevin. I pushed the DynamicByteArray and double serialization fixes to [github|https://github.com/hortonworks/orc]. I have the null column problem fixed, but it is tied into my other changes on my row-seek dev branch. I hope to finish up the row-seek today and I'll merge it into master and make the patch putting it into Hive. > Create a new Optimized Row Columnar file format for Hive > > > Key: HIVE-3874 > URL: https://issues.apache.org/jira/browse/HIVE-3874 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Reporter: Owen O'Malley >Assignee: Owen O'Malley > Attachments: hive.3874.2.patch, OrcFileIntro.pptx, orc.tgz > > > There are several limitations of the current RC File format that I'd like to > address by creating a new format: > * each column value is stored as a binary blob, which means: > ** the entire column value must be read, decompressed, and deserialized > ** the file format can't use smarter type-specific compression > ** push down filters can't be evaluated > * the start of each row group needs to be found by scanning > * user metadata can only be added to the file when the file is created > * the file doesn't store the number of rows per a file or row group > * there is no mechanism for seeking to a particular row number, which is > required for external indexes. > * there is no mechanism for storing light weight indexes within the file to > enable push-down filters to skip entire row groups. > * the type of the rows aren't stored in the file -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-3874) Create a new Optimized Row Columnar file format for Hive
[ https://issues.apache.org/jira/browse/HIVE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13566707#comment-13566707 ] Owen O'Malley commented on HIVE-3874: - [~namit], I've got one more feature that I'm working on (seek to row) and then I'll make a patch. I'm aiming to upload the patch on Friday. > Create a new Optimized Row Columnar file format for Hive > > > Key: HIVE-3874 > URL: https://issues.apache.org/jira/browse/HIVE-3874 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers > Reporter: Owen O'Malley >Assignee: Owen O'Malley > Attachments: OrcFileIntro.pptx, orc.tgz > > > There are several limitations of the current RC File format that I'd like to > address by creating a new format: > * each column value is stored as a binary blob, which means: > ** the entire column value must be read, decompressed, and deserialized > ** the file format can't use smarter type-specific compression > ** push down filters can't be evaluated > * the start of each row group needs to be found by scanning > * user metadata can only be added to the file when the file is created > * the file doesn't store the number of rows per a file or row group > * there is no mechanism for seeking to a particular row number, which is > required for external indexes. > * there is no mechanism for storing light weight indexes within the file to > enable push-down filters to skip entire row groups. > * the type of the rows aren't stored in the file -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: [VOTE] Amend Hive Bylaws + Add HCatalog Submodule
+1 and +1 On Mon, Jan 28, 2013 at 1:56 PM, Ashish Thusoo wrote: > Measure 1: +1 > Measure 2: +1 > > Ashish > > > On Mon, Jan 28, 2013 at 1:11 PM, Ashutosh Chauhan >wrote: > > > Measure 1: +1 > > Measure 2: +1 > > > > Ashutosh > > > > > > On Mon, Jan 28, 2013 at 11:48 AM, Carl Steinbach wrote: > > > >> Measure 1: +1 (binding) > >> Measure 2: +1 (binding) > >> > >> On Mon, Jan 28, 2013 at 11:47 AM, Carl Steinbach > wrote: > >> > >> > I am calling a vote on the following two measures. > >> > > >> > Measure 1: Amend Hive Bylaws to Define Submodules and Submodule > >> Committers > >> > > >> > If this measure passes the Apache Hive Project Bylaws will be > >> > amended with the following changes: > >> > > >> > > >> > > >> > https://cwiki.apache.org/confluence/display/Hive/Proposed+Changes+to+Hive+Bylaws+for+Submodule+Committers > >> > > >> > The motivation for these changes is discussed in the following > >> > email thread which appeared on the hive-dev and hcatalog-dev > >> > mailing lists: > >> > > >> > http://markmail.org/thread/u5nap7ghvyo7euqa > >> > > >> > > >> > Measure 2: Create HCatalog Submodule and Adopt HCatalog Codebase > >> > > >> > This measure provides for 1) the establishment of an HCatalog > >> > submodule in the Apache Hive Project, 2) the adoption of the > >> > Apache HCatalog codebase into the Hive HCatalog submodule, and > >> > 3) adding all currently active HCatalog committers as submodule > >> > committers on the Hive HCatalog submodule. > >> > > >> > Passage of this measure depends on the passage of Measure 1. > >> > > >> > > >> > Voting: > >> > > >> > Both measures require +1 votes from 2/3 of active Hive PMC > >> > members in order to pass. All participants in the Hive project > >> > are encouraged to vote on these measures, but only votes from > >> > active Hive PMC members are binding. The voting period > >> > commences immediately and shall last a minimum of six days. > >> > > >> > Voting is carried out by replying to this email thread. You must > >> > indicate which measure you are voting on in order for your vote > >> > to be counted. > >> > > >> > More details about the voting process can be found in the Apache > >> > Hive Project Bylaws: > >> > > >> > https://cwiki.apache.org/confluence/display/Hive/Bylaws > >> > > >> > > >> > > > > >
[jira] [Updated] (HIVE-3874) Create a new Optimized Row Columnar file format for Hive
[ https://issues.apache.org/jira/browse/HIVE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley updated HIVE-3874: Attachment: orc.tgz I've fixed some bugs. > Create a new Optimized Row Columnar file format for Hive > > > Key: HIVE-3874 > URL: https://issues.apache.org/jira/browse/HIVE-3874 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Reporter: Owen O'Malley >Assignee: Owen O'Malley > Attachments: OrcFileIntro.pptx, orc.tgz > > > There are several limitations of the current RC File format that I'd like to > address by creating a new format: > * each column value is stored as a binary blob, which means: > ** the entire column value must be read, decompressed, and deserialized > ** the file format can't use smarter type-specific compression > ** push down filters can't be evaluated > * the start of each row group needs to be found by scanning > * user metadata can only be added to the file when the file is created > * the file doesn't store the number of rows per a file or row group > * there is no mechanism for seeking to a particular row number, which is > required for external indexes. > * there is no mechanism for storing light weight indexes within the file to > enable push-down filters to skip entire row groups. > * the type of the rows aren't stored in the file -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-3874) Create a new Optimized Row Columnar file format for Hive
[ https://issues.apache.org/jira/browse/HIVE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley updated HIVE-3874: Attachment: (was: orc.tgz) > Create a new Optimized Row Columnar file format for Hive > > > Key: HIVE-3874 > URL: https://issues.apache.org/jira/browse/HIVE-3874 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers > Reporter: Owen O'Malley >Assignee: Owen O'Malley > Attachments: OrcFileIntro.pptx, orc.tgz > > > There are several limitations of the current RC File format that I'd like to > address by creating a new format: > * each column value is stored as a binary blob, which means: > ** the entire column value must be read, decompressed, and deserialized > ** the file format can't use smarter type-specific compression > ** push down filters can't be evaluated > * the start of each row group needs to be found by scanning > * user metadata can only be added to the file when the file is created > * the file doesn't store the number of rows per a file or row group > * there is no mechanism for seeking to a particular row number, which is > required for external indexes. > * there is no mechanism for storing light weight indexes within the file to > enable push-down filters to skip entire row groups. > * the type of the rows aren't stored in the file -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-3874) Create a new Optimized Row Columnar file format for Hive
[ https://issues.apache.org/jira/browse/HIVE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley updated HIVE-3874: Attachment: (was: orc.tgz) > Create a new Optimized Row Columnar file format for Hive > > > Key: HIVE-3874 > URL: https://issues.apache.org/jira/browse/HIVE-3874 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers > Reporter: Owen O'Malley >Assignee: Owen O'Malley > Attachments: OrcFileIntro.pptx, orc.tgz > > > There are several limitations of the current RC File format that I'd like to > address by creating a new format: > * each column value is stored as a binary blob, which means: > ** the entire column value must be read, decompressed, and deserialized > ** the file format can't use smarter type-specific compression > ** push down filters can't be evaluated > * the start of each row group needs to be found by scanning > * user metadata can only be added to the file when the file is created > * the file doesn't store the number of rows per a file or row group > * there is no mechanism for seeking to a particular row number, which is > required for external indexes. > * there is no mechanism for storing light weight indexes within the file to > enable push-down filters to skip entire row groups. > * the type of the rows aren't stored in the file -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-3874) Create a new Optimized Row Columnar file format for Hive
[ https://issues.apache.org/jira/browse/HIVE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley updated HIVE-3874: Attachment: orc.tgz I've updated the patch with the index suppression option that Nammit asked for. > Create a new Optimized Row Columnar file format for Hive > > > Key: HIVE-3874 > URL: https://issues.apache.org/jira/browse/HIVE-3874 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Reporter: Owen O'Malley >Assignee: Owen O'Malley > Attachments: OrcFileIntro.pptx, orc.tgz, orc.tgz > > > There are several limitations of the current RC File format that I'd like to > address by creating a new format: > * each column value is stored as a binary blob, which means: > ** the entire column value must be read, decompressed, and deserialized > ** the file format can't use smarter type-specific compression > ** push down filters can't be evaluated > * the start of each row group needs to be found by scanning > * user metadata can only be added to the file when the file is created > * the file doesn't store the number of rows per a file or row group > * there is no mechanism for seeking to a particular row number, which is > required for external indexes. > * there is no mechanism for storing light weight indexes within the file to > enable push-down filters to skip entire row groups. > * the type of the rows aren't stored in the file -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-3874) Create a new Optimized Row Columnar file format for Hive
[ https://issues.apache.org/jira/browse/HIVE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley updated HIVE-3874: Attachment: orc.tgz Here's the current version of the code. The seek to row isn't implemented and it is still a standalone project, but it will let people start looking at it. > Create a new Optimized Row Columnar file format for Hive > > > Key: HIVE-3874 > URL: https://issues.apache.org/jira/browse/HIVE-3874 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Reporter: Owen O'Malley >Assignee: Owen O'Malley > Attachments: OrcFileIntro.pptx, orc.tgz > > > There are several limitations of the current RC File format that I'd like to > address by creating a new format: > * each column value is stored as a binary blob, which means: > ** the entire column value must be read, decompressed, and deserialized > ** the file format can't use smarter type-specific compression > ** push down filters can't be evaluated > * the start of each row group needs to be found by scanning > * user metadata can only be added to the file when the file is created > * the file doesn't store the number of rows per a file or row group > * there is no mechanism for seeking to a particular row number, which is > required for external indexes. > * there is no mechanism for storing light weight indexes within the file to > enable push-down filters to skip entire row groups. > * the type of the rows aren't stored in the file -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-3874) Create a new Optimized Row Columnar file format for Hive
[ https://issues.apache.org/jira/browse/HIVE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13557353#comment-13557353 ] Owen O'Malley commented on HIVE-3874: - Yin, large stripes (and I'm defaulting to 250MB) enable efficient reads from HDFS. The row indexes help address the issue of the large stripes by providing the offsets within the large stripes. > Create a new Optimized Row Columnar file format for Hive > > > Key: HIVE-3874 > URL: https://issues.apache.org/jira/browse/HIVE-3874 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Reporter: Owen O'Malley >Assignee: Owen O'Malley > Attachments: OrcFileIntro.pptx > > > There are several limitations of the current RC File format that I'd like to > address by creating a new format: > * each column value is stored as a binary blob, which means: > ** the entire column value must be read, decompressed, and deserialized > ** the file format can't use smarter type-specific compression > ** push down filters can't be evaluated > * the start of each row group needs to be found by scanning > * user metadata can only be added to the file when the file is created > * the file doesn't store the number of rows per a file or row group > * there is no mechanism for seeking to a particular row number, which is > required for external indexes. > * there is no mechanism for storing light weight indexes within the file to > enable push-down filters to skip entire row groups. > * the type of the rows aren't stored in the file -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-3874) Create a new Optimized Row Columnar file format for Hive
[ https://issues.apache.org/jira/browse/HIVE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13557341#comment-13557341 ] Owen O'Malley commented on HIVE-3874: - Joydeep, I've used a two level strategy: * large stripes (default 250MB) to enable large efficient reads * relatively frequent row index entries (default 10k rows) to enable skipping with in a stripe The row index entries have the locations within each column to enable seeking to the right compression block and byte within the decompressed block. I obviously did consider HFile, although from a practical point of view it is fairly embedded within HBase. Additionally, since it treats each of the columns as bytes it can't do any type-specific encodings/compression and can't interpret the column values, which is critical for performance. Once you have the ability to skip large sets of rows based on the filter predicates, you can sort the table on the secondary keys and achieve a large speed up. For example, if your primary partition is transaction date, you might want to sort the table on state, zip, and last name. Then if you are looking for just the records in CA it won't need to read the records for the other states. > Create a new Optimized Row Columnar file format for Hive > > > Key: HIVE-3874 > URL: https://issues.apache.org/jira/browse/HIVE-3874 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers > Reporter: Owen O'Malley >Assignee: Owen O'Malley > Attachments: OrcFileIntro.pptx > > > There are several limitations of the current RC File format that I'd like to > address by creating a new format: > * each column value is stored as a binary blob, which means: > ** the entire column value must be read, decompressed, and deserialized > ** the file format can't use smarter type-specific compression > ** push down filters can't be evaluated > * the start of each row group needs to be found by scanning > * user metadata can only be added to the file when the file is created > * the file doesn't store the number of rows per a file or row group > * there is no mechanism for seeking to a particular row number, which is > required for external indexes. > * there is no mechanism for storing light weight indexes within the file to > enable push-down filters to skip entire row groups. > * the type of the rows aren't stored in the file -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-3874) Create a new Optimized Row Columnar file format for Hive
[ https://issues.apache.org/jira/browse/HIVE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13551286#comment-13551286 ] Owen O'Malley commented on HIVE-3874: - Sambavi, I should have a patch ready next week. Yes, the row groups (stripes) are 250MB by default. I currently set the HDFS block size for the files to 2 times the stripe size, but I don't try to align them other than that. > Create a new Optimized Row Columnar file format for Hive > > > Key: HIVE-3874 > URL: https://issues.apache.org/jira/browse/HIVE-3874 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Reporter: Owen O'Malley >Assignee: Owen O'Malley > Attachments: OrcFileIntro.pptx > > > There are several limitations of the current RC File format that I'd like to > address by creating a new format: > * each column value is stored as a binary blob, which means: > ** the entire column value must be read, decompressed, and deserialized > ** the file format can't use smarter type-specific compression > ** push down filters can't be evaluated > * the start of each row group needs to be found by scanning > * user metadata can only be added to the file when the file is created > * the file doesn't store the number of rows per a file or row group > * there is no mechanism for seeking to a particular row number, which is > required for external indexes. > * there is no mechanism for storing light weight indexes within the file to > enable push-down filters to skip entire row groups. > * the type of the rows aren't stored in the file -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-3874) Create a new Optimized Row Columnar file format for Hive
[ https://issues.apache.org/jira/browse/HIVE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13551248#comment-13551248 ] Owen O'Malley commented on HIVE-3874: - Doug, of course Trevni could be modified arbitrarily to match the needs of Hive. But Hive will benefit more if there is a deep integration between the file format and the query engine. Both HBase and Accumulo have file formats that were originally based on Hadoop's TFile. But the need for integration with the query engine was such that their projects were better served by having the file format in their project rather than an upstream project. Of course the Avro project is free to copy any of the ORC code into Trevni, but Hive has the need to innovate in this area without asking Avro to make changes and waiting for them to be released. > Create a new Optimized Row Columnar file format for Hive > > > Key: HIVE-3874 > URL: https://issues.apache.org/jira/browse/HIVE-3874 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Reporter: Owen O'Malley >Assignee: Owen O'Malley > Attachments: OrcFileIntro.pptx > > > There are several limitations of the current RC File format that I'd like to > address by creating a new format: > * each column value is stored as a binary blob, which means: > ** the entire column value must be read, decompressed, and deserialized > ** the file format can't use smarter type-specific compression > ** push down filters can't be evaluated > * the start of each row group needs to be found by scanning > * user metadata can only be added to the file when the file is created > * the file doesn't store the number of rows per a file or row group > * there is no mechanism for seeking to a particular row number, which is > required for external indexes. > * there is no mechanism for storing light weight indexes within the file to > enable push-down filters to skip entire row groups. > * the type of the rows aren't stored in the file -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-3874) Create a new Optimized Row Columnar file format for Hive
[ https://issues.apache.org/jira/browse/HIVE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13551231#comment-13551231 ] Owen O'Malley commented on HIVE-3874: - Namit, I'm using the table properties to manage the other features like compression, so I would probably make a table property like 'orc.create.index' or something. Would that make sense? I should note that the indexes are very light. In a sample file: * uncompressed text: 370MB * compress orc: 86MB * row index in orc: 140k > Create a new Optimized Row Columnar file format for Hive > > > Key: HIVE-3874 > URL: https://issues.apache.org/jira/browse/HIVE-3874 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Reporter: Owen O'Malley >Assignee: Owen O'Malley > Attachments: OrcFileIntro.pptx > > > There are several limitations of the current RC File format that I'd like to > address by creating a new format: > * each column value is stored as a binary blob, which means: > ** the entire column value must be read, decompressed, and deserialized > ** the file format can't use smarter type-specific compression > ** push down filters can't be evaluated > * the start of each row group needs to be found by scanning > * user metadata can only be added to the file when the file is created > * the file doesn't store the number of rows per a file or row group > * there is no mechanism for seeking to a particular row number, which is > required for external indexes. > * there is no mechanism for storing light weight indexes within the file to > enable push-down filters to skip entire row groups. > * the type of the rows aren't stored in the file -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-3889) Add floating point compression to ORC file
[ https://issues.apache.org/jira/browse/HIVE-3889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley updated HIVE-3889: Attachment: fpc-impl.tar This is the file that Karol emailed to me for me to submit to Apache. > Add floating point compression to ORC file > -- > > Key: HIVE-3889 > URL: https://issues.apache.org/jira/browse/HIVE-3889 > Project: Hive > Issue Type: New Feature > Components: Serializers/Deserializers > Reporter: Owen O'Malley >Assignee: Owen O'Malley > Attachments: fpc-impl.tar > > > Karol Wegrzycki, a CS student at University of Warsaw, has implemented an FPC > compressor for doubles. It would be great to hook this up to the ORC file > format so that we can get better compression for doubles. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HIVE-3889) Add floating point compression to ORC file
Owen O'Malley created HIVE-3889: --- Summary: Add floating point compression to ORC file Key: HIVE-3889 URL: https://issues.apache.org/jira/browse/HIVE-3889 Project: Hive Issue Type: New Feature Components: Serializers/Deserializers Reporter: Owen O'Malley Assignee: Owen O'Malley Karol Wegrzycki, a CS student at University of Warsaw, has implemented an FPC compressor for doubles. It would be great to hook this up to the ORC file format so that we can get better compression for doubles. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-3874) Create a new Optimized Row Columnar file format for Hive
[ https://issues.apache.org/jira/browse/HIVE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13549839#comment-13549839 ] Owen O'Malley commented on HIVE-3874: - Namit, for pure hive users there aren't any advantages of trevni over ORC. > Create a new Optimized Row Columnar file format for Hive > > > Key: HIVE-3874 > URL: https://issues.apache.org/jira/browse/HIVE-3874 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Reporter: Owen O'Malley >Assignee: Owen O'Malley > Attachments: OrcFileIntro.pptx > > > There are several limitations of the current RC File format that I'd like to > address by creating a new format: > * each column value is stored as a binary blob, which means: > ** the entire column value must be read, decompressed, and deserialized > ** the file format can't use smarter type-specific compression > ** push down filters can't be evaluated > * the start of each row group needs to be found by scanning > * user metadata can only be added to the file when the file is created > * the file doesn't store the number of rows per a file or row group > * there is no mechanism for seeking to a particular row number, which is > required for external indexes. > * there is no mechanism for storing light weight indexes within the file to > enable push-down filters to skip entire row groups. > * the type of the rows aren't stored in the file -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-3874) Create a new Optimized Row Columnar file format for Hive
[ https://issues.apache.org/jira/browse/HIVE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13549784#comment-13549784 ] Owen O'Malley commented on HIVE-3874: - Namit, I obviously did consider Trevni, but it didn't support some of the features that I wanted: * using the hive type model * more advanced encodings like dictionaries * the ability to support push down predicates for skipping row groups * running compression in block mode rather than streaming so that the reader can skip entire compression blocks > Create a new Optimized Row Columnar file format for Hive > > > Key: HIVE-3874 > URL: https://issues.apache.org/jira/browse/HIVE-3874 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Reporter: Owen O'Malley >Assignee: Owen O'Malley > Attachments: OrcFileIntro.pptx > > > There are several limitations of the current RC File format that I'd like to > address by creating a new format: > * each column value is stored as a binary blob, which means: > ** the entire column value must be read, decompressed, and deserialized > ** the file format can't use smarter type-specific compression > ** push down filters can't be evaluated > * the start of each row group needs to be found by scanning > * user metadata can only be added to the file when the file is created > * the file doesn't store the number of rows per a file or row group > * there is no mechanism for seeking to a particular row number, which is > required for external indexes. > * there is no mechanism for storing light weight indexes within the file to > enable push-down filters to skip entire row groups. > * the type of the rows aren't stored in the file -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-3874) Create a new Optimized Row Columnar file format for Hive
[ https://issues.apache.org/jira/browse/HIVE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13549774#comment-13549774 ] Owen O'Malley commented on HIVE-3874: - He Yongqiang, the APIs to the two formats are significantly different. It would be possible to extend the RCFile reader to recognize an ORC file and to have it delegate to the ORC File reader. The other direction (having the ORC file reader parse an RCFile) isn't possible, because ORC provides operations that would be very expensive or impossible to implement in RCFile. One concern with making the RCFile reader delegate to the ORC file reader is that RCFile returns binary values that are interpreted by the serde while in ORC deserialization happens in the reader. Therefore, either the adaptor would need to re-serialize the data or would require changes in the serde as well. > Create a new Optimized Row Columnar file format for Hive > > > Key: HIVE-3874 > URL: https://issues.apache.org/jira/browse/HIVE-3874 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Reporter: Owen O'Malley >Assignee: Owen O'Malley > Attachments: OrcFileIntro.pptx > > > There are several limitations of the current RC File format that I'd like to > address by creating a new format: > * each column value is stored as a binary blob, which means: > ** the entire column value must be read, decompressed, and deserialized > ** the file format can't use smarter type-specific compression > ** push down filters can't be evaluated > * the start of each row group needs to be found by scanning > * user metadata can only be added to the file when the file is created > * the file doesn't store the number of rows per a file or row group > * there is no mechanism for seeking to a particular row number, which is > required for external indexes. > * there is no mechanism for storing light weight indexes within the file to > enable push-down filters to skip entire row groups. > * the type of the rows aren't stored in the file -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-3874) Create a new Optimized Row Columnar file format for Hive
[ https://issues.apache.org/jira/browse/HIVE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley updated HIVE-3874: Attachment: OrcFileIntro.pptx > Create a new Optimized Row Columnar file format for Hive > > > Key: HIVE-3874 > URL: https://issues.apache.org/jira/browse/HIVE-3874 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers > Reporter: Owen O'Malley >Assignee: Owen O'Malley > Attachments: OrcFileIntro.pptx > > > There are several limitations of the current RC File format that I'd like to > address by creating a new format: > * each column value is stored as a binary blob, which means: > ** the entire column value must be read, decompressed, and deserialized > ** the file format can't use smarter type-specific compression > ** push down filters can't be evaluated > * the start of each row group needs to be found by scanning > * user metadata can only be added to the file when the file is created > * the file doesn't store the number of rows per a file or row group > * there is no mechanism for seeking to a particular row number, which is > required for external indexes. > * there is no mechanism for storing light weight indexes within the file to > enable push-down filters to skip entire row groups. > * the type of the rows aren't stored in the file -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-3874) Create a new Optimized Row Columnar file format for Hive
[ https://issues.apache.org/jira/browse/HIVE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13548750#comment-13548750 ] Owen O'Malley commented on HIVE-3874: - Namit, Yes, it has dictionary encoding for strings. The dictionary enables both better compression and makes push down filters much more efficient. The dictionaries are local to only the row group, so that row groups can be processed independently of each other. Currently, strings are always dictionary encoded, but it would make sense to allow the writer to pick whether the column should be encoded directly or using a dictionary. > Create a new Optimized Row Columnar file format for Hive > > > Key: HIVE-3874 > URL: https://issues.apache.org/jira/browse/HIVE-3874 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Reporter: Owen O'Malley >Assignee: Owen O'Malley > > There are several limitations of the current RC File format that I'd like to > address by creating a new format: > * each column value is stored as a binary blob, which means: > ** the entire column value must be read, decompressed, and deserialized > ** the file format can't use smarter type-specific compression > ** push down filters can't be evaluated > * the start of each row group needs to be found by scanning > * user metadata can only be added to the file when the file is created > * the file doesn't store the number of rows per a file or row group > * there is no mechanism for seeking to a particular row number, which is > required for external indexes. > * there is no mechanism for storing light weight indexes within the file to > enable push-down filters to skip entire row groups. > * the type of the rows aren't stored in the file -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HIVE-3874) Create a new Optimized Row Columnar file format for Hive
Owen O'Malley created HIVE-3874: --- Summary: Create a new Optimized Row Columnar file format for Hive Key: HIVE-3874 URL: https://issues.apache.org/jira/browse/HIVE-3874 Project: Hive Issue Type: Improvement Components: Serializers/Deserializers Reporter: Owen O'Malley Assignee: Owen O'Malley There are several limitations of the current RC File format that I'd like to address by creating a new format: * each column value is stored as a binary blob, which means: ** the entire column value must be read, decompressed, and deserialized ** the file format can't use smarter type-specific compression ** push down filters can't be evaluated * the start of each row group needs to be found by scanning * user metadata can only be added to the file when the file is created * the file doesn't store the number of rows per a file or row group * there is no mechanism for seeking to a particular row number, which is required for external indexes. * there is no mechanism for storing light weight indexes within the file to enable push-down filters to skip entire row groups. * the type of the rows aren't stored in the file -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-3234) getting the reporter in the recordwriter
[ https://issues.apache.org/jira/browse/HIVE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley updated HIVE-3234: Status: Patch Available (was: Open) This patch passes in the real mapreduce reporter as the progressable for getHiveReportWriter. OutputFormats should still protect themselves from null in the Progressable, but the FileSinkOperator passes a Reporter from the mapreduce job. > getting the reporter in the recordwriter > > > Key: HIVE-3234 > URL: https://issues.apache.org/jira/browse/HIVE-3234 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 0.9.1 > Environment: any >Reporter: Jimmy Hu >Assignee: Owen O'Malley > Labels: newbie > Fix For: 0.9.1 > > Attachments: HIVE-3234.D6699.1.patch > > Original Estimate: 48h > Remaining Estimate: 48h > > We would like to generate some custom statistics and report back to > map/reduce later wen implement the > FileSinkOperator.RecordWriter interface. However, the current interface > design doesn't allow us to get the map reduce reporter object. Please extend > the current FileSinkOperator.RecordWriter interface so that it's close() > method passes in a map reduce reporter object. > For the same reason, please also extend the RecordReader interface too to > include a reporter object so that users can passes in custom map reduce > counters. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: hive 0.10 release
+1 On Thu, Nov 8, 2012 at 3:18 PM, Carl Steinbach wrote: > +1 > > On Wed, Nov 7, 2012 at 11:23 PM, Alexander Lorenz >wrote: > > > +1, good karma > > > > On Nov 8, 2012, at 4:58 AM, Namit Jain wrote: > > > > > +1 to the idea > > > > > > On 11/8/12 6:33 AM, "Edward Capriolo" wrote: > > > > > >> That sounds good. I think this issue needs to be solved as well as > > >> anything else that produces a bugus query result. > > >> > > >> https://issues.apache.org/jira/browse/HIVE-3083 > > >> > > >> Edward > > >> > > >> On Wed, Nov 7, 2012 at 7:50 PM, Ashutosh Chauhan < > hashut...@apache.org> > > >> wrote: > > >>> Hi, > > >>> > > >>> Its been a while since we released 0.10 more than six months ago. All > > >>> this > > >>> while, lot of action has happened with various cool features landing > in > > >>> trunk. Additionally, I am looking forward to HiveServer2 landing in > > >>> trunk. So, I propose that we cut the branch for 0.10 soon afterwards > > >>> and > > >>> than release it. Thoughts? > > >>> > > >>> Thanks, > > >>> Ashutosh > > > > > > > -- > > Alexander Alten-Lorenz > > http://mapredit.blogspot.com > > German Hadoop LinkedIn Group: http://goo.gl/N8pCF > > > > >
[jira] [Created] (HIVE-3660) Improve OutputFormat for Hive
Owen O'Malley created HIVE-3660: --- Summary: Improve OutputFormat for Hive Key: HIVE-3660 URL: https://issues.apache.org/jira/browse/HIVE-3660 Project: Hive Issue Type: Improvement Components: Serializers/Deserializers Reporter: Owen O'Malley Assignee: Owen O'Malley Hive's output formats are currently given a list of binary blobs to store, which severely limits the options for file formats. I'd like to create a new OutputFormat interface that provides: * table properties * object inspector for the row * type info for the row The RecordWriter would be passed the internal row object. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-3599) missing return of compression codec to pool
[ https://issues.apache.org/jira/browse/HIVE-3599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley updated HIVE-3599: Status: Patch Available (was: Open) > missing return of compression codec to pool > --- > > Key: HIVE-3599 > URL: https://issues.apache.org/jira/browse/HIVE-3599 > Project: Hive > Issue Type: Bug > Components: Query Processor > Reporter: Owen O'Malley >Assignee: Owen O'Malley > Attachments: hive-3599.patch > > > The RCFile writer is currently missing a call to return of one of the > compression codecs to the pool. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-3599) missing return of compression codec to pool
[ https://issues.apache.org/jira/browse/HIVE-3599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley updated HIVE-3599: Attachment: hive-3599.patch Here's the obvious fix. There aren't any functional difference. > missing return of compression codec to pool > --- > > Key: HIVE-3599 > URL: https://issues.apache.org/jira/browse/HIVE-3599 > Project: Hive > Issue Type: Bug > Components: Query Processor >Reporter: Owen O'Malley >Assignee: Owen O'Malley > Attachments: hive-3599.patch > > > The RCFile writer is currently missing a call to return of one of the > compression codecs to the pool. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HIVE-3599) missing return of compression codec to pool
Owen O'Malley created HIVE-3599: --- Summary: missing return of compression codec to pool Key: HIVE-3599 URL: https://issues.apache.org/jira/browse/HIVE-3599 Project: Hive Issue Type: Bug Components: Query Processor Reporter: Owen O'Malley The RCFile writer is currently missing a call to return of one of the compression codecs to the pool. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HIVE-3599) missing return of compression codec to pool
[ https://issues.apache.org/jira/browse/HIVE-3599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley reassigned HIVE-3599: --- Assignee: Owen O'Malley > missing return of compression codec to pool > --- > > Key: HIVE-3599 > URL: https://issues.apache.org/jira/browse/HIVE-3599 > Project: Hive > Issue Type: Bug > Components: Query Processor >Reporter: Owen O'Malley >Assignee: Owen O'Malley > > The RCFile writer is currently missing a call to return of one of the > compression codecs to the pool. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: non map-reduce for simple queries
On Mon, Jul 30, 2012 at 11:38 PM, Namit Jain wrote: > That would be difficult. The % done can be estimated from the data already > read. > I'm confused. Wouldn't the maximum size of the data remaining over the maximum size of the original query give a reasonable approximation of the amount of work done? > > It might be simpler to have a check like: if the query isn't done in > the first 5 seconds of running locally, you switch to mapreduce. > There are three problems I see: * If the query is 95% done at 5 seconds, it is a shame to kill it and start over again at 0% on mapreduce with a much longer latency. (Instead of spending the additional 0.25 seconds you spend an additional 60+.) * You can't print anything until you know whether you are going to kill it or not. (The mapreduce results might come back in a different order) With user-facing programs, it is much better to start printing early instead of later since it gives faster feedback to the user. * It isn't predictable how the query will run. That makes it very hard to build applications on top of Hive. Do those make sense?
Re: non map-reduce for simple queries
On Mon, Jul 30, 2012 at 9:12 PM, Namit Jain wrote: > The total number of bytes of the input will be used to determine whether > to not launch a map-reduce job for this > query. That was in my original mail. > > However, given any complex where condition and the lack of column > statistics in hive, we cannot determine the > number of bytes that would be needed to satisfy the where condition. All of these are heuristics are guidelines, clearly. My inclination would be to use the maximum data volume as the primary metric until we have a better understanding of cases where that doesn't work well. If we are going to try the local solution and fall back to mapreduce, it seems better to put a limit well short of being done so that you don't waste as much work. Perhaps, if the query isn't 10% done in the first 5 seconds of running locally, you switch to mapreduce. Would that work? -- Owen
Re: non map-reduce for simple queries
On Sat, Jul 28, 2012 at 6:17 PM, Navis류승우 wrote: > I was thinking of timeout for fetching, 2000msec for example. How about > that? > Instead of time, which requires launching the query and letting it timeout, how about determining the number of bytes that would need to be fetched to the local box? Limiting it to 100 or 200 mb seems reasonable. -- Owen
[jira] [Commented] (HIVE-3153) Release codecs and output streams between flushes of RCFile
[ https://issues.apache.org/jira/browse/HIVE-3153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13423464#comment-13423464 ] Owen O'Malley commented on HIVE-3153: - I also wrote a test program that just writes to a large number of RCFile.Writers. With the patch, I was able to use a lot more Writers before I ran out of memory in the process. > Release codecs and output streams between flushes of RCFile > --- > > Key: HIVE-3153 > URL: https://issues.apache.org/jira/browse/HIVE-3153 > Project: Hive > Issue Type: Improvement > Components: Compression >Reporter: Owen O'Malley >Assignee: Owen O'Malley > Attachments: hive-3153.patch > > > Currently, RCFile writer holds a compression codec per a file and a > compression output stream per a column. Especially for queries that use > dynamic partitions this quickly consumes a lot of memory. > I'd like flushRecords to get a codec from the pool and create the compression > output stream in flushRecords. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-3153) Release codecs and output streams between flushes of RCFile
[ https://issues.apache.org/jira/browse/HIVE-3153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13423451#comment-13423451 ] Owen O'Malley commented on HIVE-3153: - The use case that this helps is the one with a relatively (~2000) dynamic partitions per a reducer. In that case it will have an open RCFile.Writer per a dynamic partition, but they aren't being flushed in parallel. By moving the extra buffers and compression codecs so that they are acquired only when they are needed for flush instead of during the whole lifespan of the Writer, I'm able to keep a lot more Writers open at once. > Release codecs and output streams between flushes of RCFile > --- > > Key: HIVE-3153 > URL: https://issues.apache.org/jira/browse/HIVE-3153 > Project: Hive > Issue Type: Improvement > Components: Compression >Reporter: Owen O'Malley >Assignee: Owen O'Malley > Attachments: hive-3153.patch > > > Currently, RCFile writer holds a compression codec per a file and a > compression output stream per a column. Especially for queries that use > dynamic partitions this quickly consumes a lot of memory. > I'd like flushRecords to get a codec from the pool and create the compression > output stream in flushRecords. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-3153) Release codecs and output streams between flushes of RCFile
[ https://issues.apache.org/jira/browse/HIVE-3153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13421023#comment-13421023 ] Owen O'Malley commented on HIVE-3153: - I just posted this as https://reviews.facebook.net/D4299 . > Release codecs and output streams between flushes of RCFile > --- > > Key: HIVE-3153 > URL: https://issues.apache.org/jira/browse/HIVE-3153 > Project: Hive > Issue Type: Improvement > Components: Compression >Reporter: Owen O'Malley >Assignee: Owen O'Malley > Attachments: hive-3153.patch > > > Currently, RCFile writer holds a compression codec per a file and a > compression output stream per a column. Especially for queries that use > dynamic partitions this quickly consumes a lot of memory. > I'd like flushRecords to get a codec from the pool and create the compression > output stream in flushRecords. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HIVE-3234) getting the reporter in the recordwriter
[ https://issues.apache.org/jira/browse/HIVE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley reassigned HIVE-3234: --- Assignee: Owen O'Malley > getting the reporter in the recordwriter > > > Key: HIVE-3234 > URL: https://issues.apache.org/jira/browse/HIVE-3234 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 0.9.1 > Environment: any > Reporter: Jimmy Hu >Assignee: Owen O'Malley > Labels: newbie > Fix For: 0.9.1 > > Original Estimate: 48h > Remaining Estimate: 48h > > We would like to generate some custom statistics and report back to > map/reduce later wen implement the > FileSinkOperator.RecordWriter interface. However, the current interface > design doesn't allow us to get the map reduce reporter object. Please extend > the current FileSinkOperator.RecordWriter interface so that it's close() > method passes in a map reduce reporter object. > For the same reason, please also extend the RecordReader interface too to > include a reporter object so that users can passes in custom map reduce > counters. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-3098) Memory leak from large number of FileSystem instances in FileSystem.CACHE. (Must cache UGIs.)
[ https://issues.apache.org/jira/browse/HIVE-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13403579#comment-13403579 ] Owen O'Malley commented on HIVE-3098: - Alejandro, Daryn is absolutely right that we can't make the Subjects immutable. We need to be able to update a Subject with update Kerberos tickets and Tokens and changing that would break a lot of other code. It would probably make sense to make a UGI.doAsAndCleanup that does a doAs and then removes all filesystems based on the ugi, since clearly most of the Hadoop ecosystem servers have related problems. > Memory leak from large number of FileSystem instances in FileSystem.CACHE. > (Must cache UGIs.) > - > > Key: HIVE-3098 > URL: https://issues.apache.org/jira/browse/HIVE-3098 > Project: Hive > Issue Type: Bug > Components: Shims >Affects Versions: 0.9.0 > Environment: Running with Hadoop 20.205.0.3+ / 1.0.x with security > turned on. >Reporter: Mithun Radhakrishnan >Assignee: Mithun Radhakrishnan > Attachments: HIVE-3098.patch > > > The problem manifested from stress-testing HCatalog 0.4.1 (as part of testing > the Oracle backend). > The HCatalog server ran out of memory (-Xmx2048m) when pounded by 60-threads, > in under 24 hours. The heap-dump indicates that hadoop::FileSystem.CACHE had > 100 instances of FileSystem, whose combined retained-mem consumed the > entire heap. > It boiled down to hadoop::UserGroupInformation::equals() being implemented > such that the "Subject" member is compared for equality ("=="), and not > equivalence (".equals()"). This causes equivalent UGI instances to compare as > unequal, and causes a new FileSystem instance to be created and cached. > The UGI.equals() is so implemented, incidentally, as a fix for yet another > problem (HADOOP-6670); so it is unlikely that that implementation can be > modified. > The solution for this is to check for UGI equivalence in HCatalog (i.e. in > the Hive metastore), using an cache for UGI instances in the shims. > I have a patch to fix this. I'll upload it shortly. I just ran an overnight > test to confirm that the memory-leak has been arrested. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-3153) Release codecs and output streams between flushes of RCFile
[ https://issues.apache.org/jira/browse/HIVE-3153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley updated HIVE-3153: Attachment: hive-3153.patch This patch: * Fixes some javadoc * Suppresses some unused warnings * I deprecated some of the unused public functions that don't seem to be important parts of the API. * Reduces the memory footprint of the Writer to just the array of ColumnBuffers. With this patch, I'm able to write to use many more parallel writers in the same memory footprint. > Release codecs and output streams between flushes of RCFile > --- > > Key: HIVE-3153 > URL: https://issues.apache.org/jira/browse/HIVE-3153 > Project: Hive > Issue Type: Improvement > Components: Compression >Reporter: Owen O'Malley >Assignee: Owen O'Malley > Attachments: hive-3153.patch > > > Currently, RCFile writer holds a compression codec per a file and a > compression output stream per a column. Especially for queries that use > dynamic partitions this quickly consumes a lot of memory. > I'd like flushRecords to get a codec from the pool and create the compression > output stream in flushRecords. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-3153) Release codecs and output streams between flushes of RCFile
[ https://issues.apache.org/jira/browse/HIVE-3153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley updated HIVE-3153: Status: Patch Available (was: Open) > Release codecs and output streams between flushes of RCFile > --- > > Key: HIVE-3153 > URL: https://issues.apache.org/jira/browse/HIVE-3153 > Project: Hive > Issue Type: Improvement > Components: Compression > Reporter: Owen O'Malley >Assignee: Owen O'Malley > Attachments: hive-3153.patch > > > Currently, RCFile writer holds a compression codec per a file and a > compression output stream per a column. Especially for queries that use > dynamic partitions this quickly consumes a lot of memory. > I'd like flushRecords to get a codec from the pool and create the compression > output stream in flushRecords. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira