[jira] [Created] (HIVE-4250) Closing lots of RecordWriters is slow

2013-03-28 Thread Owen O'Malley (JIRA)
Owen O'Malley created HIVE-4250:
---

 Summary: Closing lots of RecordWriters is slow
 Key: HIVE-4250
 URL: https://issues.apache.org/jira/browse/HIVE-4250
 Project: Hive
  Issue Type: New Feature
  Components: Serializers/Deserializers
Reporter: Owen O'Malley
Assignee: Owen O'Malley


In FileSinkOperator, all of the RecordWriters are closed sequentially. For 
queries with a lot of dynamic partitions this can add substantially to the task 
time. For one query in particular, after processing all of the records in a few 
minutes the reduces spend 15 minutes closing all of the RC files.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4248) Implement a memory manager for ORC

2013-03-28 Thread Owen O'Malley (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13616691#comment-13616691
 ] 

Owen O'Malley commented on HIVE-4248:
-

This may result in ORC files with smaller stripes, but that seems far better 
than letting the users get out of memory exceptions.

> Implement a memory manager for ORC
> --
>
> Key: HIVE-4248
> URL: https://issues.apache.org/jira/browse/HIVE-4248
> Project: Hive
>  Issue Type: New Feature
>  Components: Serializers/Deserializers
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
>
> With the large default stripe size (256MB) and dynamic partitions, it is 
> quite easy for users to run out of memory when writing ORC files. We probably 
> need a solution that keeps track of the total number of concurrent ORC 
> writers and divides the available heap space between them. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HIVE-4248) Implement a memory manager for ORC

2013-03-28 Thread Owen O'Malley (JIRA)
Owen O'Malley created HIVE-4248:
---

 Summary: Implement a memory manager for ORC
 Key: HIVE-4248
 URL: https://issues.apache.org/jira/browse/HIVE-4248
 Project: Hive
  Issue Type: New Feature
  Components: Serializers/Deserializers
Reporter: Owen O'Malley
Assignee: Owen O'Malley


With the large default stripe size (256MB) and dynamic partitions, it is quite 
easy for users to run out of memory when writing ORC files. We probably need a 
solution that keeps track of the total number of concurrent ORC writers and 
divides the available heap space between them. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4227) Add column level encryption to ORC files

2013-03-28 Thread Owen O'Malley (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13616678#comment-13616678
 ] 

Owen O'Malley commented on HIVE-4227:
-

Andrew,
  Yes if the code is available and provides the right API.

> Add column level encryption to ORC files
> 
>
> Key: HIVE-4227
> URL: https://issues.apache.org/jira/browse/HIVE-4227
> Project: Hive
>  Issue Type: New Feature
>    Reporter: Owen O'Malley
>  Labels: gsoc, gsoc2013
>
> It would be useful to support column level encryption in ORC files. Since 
> each column and its associated index is stored separately, encrypting a 
> column separately isn't difficult. In terms of key distribution, it would 
> make sense to use an external server like the one in HADOOP-9331.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4244) Make string dictionaries adaptive in ORC

2013-03-28 Thread Owen O'Malley (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13616657#comment-13616657
 ] 

Owen O'Malley commented on HIVE-4244:
-

We should play with different values, but I was guessing the right cutover 
point for the heuristic was at a loading of 2 to 3 (50% to 33% distinct values).

We aren't really going to know whether the heuristic is right or wrong unless 
we compare both encodings, which is much too expensive. By taking a good guess 
after looking at the start of the stripe, we can get good performance most of 
the time.

> Make string dictionaries adaptive in ORC
> 
>
> Key: HIVE-4244
> URL: https://issues.apache.org/jira/browse/HIVE-4244
> Project: Hive
>  Issue Type: Bug
>  Components: Serializers/Deserializers
>Reporter: Owen O'Malley
>Assignee: Kevin Wilfong
>
> The ORC writer should adaptively switch between dictionary and direct 
> encoding. I'd propose looking at the first 100,000 values in each column and 
> decide whether there is sufficient loading in the dictionary to use 
> dictionary encoding.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4245) Implement numeric dictionaries in ORC

2013-03-28 Thread Owen O'Malley (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13616613#comment-13616613
 ] 

Owen O'Malley commented on HIVE-4245:
-

If you look at the original ORC github, you can see a float and double redblack 
tree that I pulled out in getting it ready for the initial push into Apache. 

https://github.com/hortonworks/orc/tree/9cdb2e88d377c801655fbb9015938ea3a93e12ca/src/main/java/org/apache/hadoop/hive/ql/io/orc

> Implement numeric dictionaries in ORC
> -
>
> Key: HIVE-4245
> URL: https://issues.apache.org/jira/browse/HIVE-4245
> Project: Hive
>  Issue Type: New Feature
>  Components: Serializers/Deserializers
>Reporter: Owen O'Malley
>Assignee: Pamela Vagata
>
> For many applications, especially in de-normalized data, there is a lot of 
> redundancy in the numeric columns. Therefore, it would make sense to 
> adaptively use dictionary encodings for numeric columns in addition to string 
> columns.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HIVE-4246) Implement predicate pushdown for ORC

2013-03-28 Thread Owen O'Malley (JIRA)
Owen O'Malley created HIVE-4246:
---

 Summary: Implement predicate pushdown for ORC
 Key: HIVE-4246
 URL: https://issues.apache.org/jira/browse/HIVE-4246
 Project: Hive
  Issue Type: New Feature
Reporter: Owen O'Malley
Assignee: Owen O'Malley


By using the push down predicates from the table scan operator, ORC can skip 
over 10,000 rows at a time that won't satisfy the predicate. This will help a 
lot, especially if the file is sorted by the column that is used in the 
predicate.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (HIVE-4121) ORC should have optional dictionaries for both strings and numeric types

2013-03-28 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-4121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley resolved HIVE-4121.
-

Resolution: Duplicate

I forgot I had filed this and filed the split apart on as HIVE-4244 and 
HIVE-4245.

> ORC should have optional dictionaries for both strings and numeric types
> 
>
> Key: HIVE-4121
> URL: https://issues.apache.org/jira/browse/HIVE-4121
> Project: Hive
>  Issue Type: New Feature
>  Components: Serializers/Deserializers
>    Reporter: Owen O'Malley
>Assignee: Owen O'Malley
>
> Currently string columns always have dictionaries and numerics are always 
> directly encoded. It would be better to make the encoding depend on a sample 
> of the data. Perhaps the first 100k values should be evaluated for repeated 
> values and the encoding picked for the stripe.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HIVE-4245) Implement numeric dictionaries in ORC

2013-03-28 Thread Owen O'Malley (JIRA)
Owen O'Malley created HIVE-4245:
---

 Summary: Implement numeric dictionaries in ORC
 Key: HIVE-4245
 URL: https://issues.apache.org/jira/browse/HIVE-4245
 Project: Hive
  Issue Type: New Feature
  Components: Serializers/Deserializers
Reporter: Owen O'Malley
Assignee: Owen O'Malley


For many applications, especially in de-normalized data, there is a lot of 
redundancy in the numeric columns. Therefore, it would make sense to adaptively 
use dictionary encodings for numeric columns in addition to string columns.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HIVE-4244) Make string dictionaries adaptive in ORC

2013-03-28 Thread Owen O'Malley (JIRA)
Owen O'Malley created HIVE-4244:
---

 Summary: Make string dictionaries adaptive in ORC
 Key: HIVE-4244
 URL: https://issues.apache.org/jira/browse/HIVE-4244
 Project: Hive
  Issue Type: Bug
  Components: Serializers/Deserializers
Reporter: Owen O'Malley
Assignee: Owen O'Malley


The ORC writer should adaptively switch between dictionary and direct encoding. 
I'd propose looking at the first 100,000 values in each column and decide 
whether there is sufficient loading in the dictionary to use dictionary 
encoding.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (HIVE-2162) Upgrade dependencies to Hadoop 0.20.2 and 0.20.203.0

2013-03-28 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley resolved HIVE-2162.
-

Resolution: Duplicate

This has been fixed already.

> Upgrade dependencies to Hadoop 0.20.2 and 0.20.203.0
> 
>
> Key: HIVE-2162
> URL: https://issues.apache.org/jira/browse/HIVE-2162
> Project: Hive
>  Issue Type: Improvement
>    Reporter: Owen O'Malley
>
> Hadoop has released 0.20.203.0 and we should upgrade Hive's dependency to it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HIVE-4243) Fix column names in FileSinkOperator

2013-03-28 Thread Owen O'Malley (JIRA)
Owen O'Malley created HIVE-4243:
---

 Summary: Fix column names in FileSinkOperator
 Key: HIVE-4243
 URL: https://issues.apache.org/jira/browse/HIVE-4243
 Project: Hive
  Issue Type: Bug
  Components: Serializers/Deserializers
Reporter: Owen O'Malley


All of the ObjectInspectors given to SerDe's by FileSinkOperator have virtual 
column names. Since the files are part of tables, Hive knows the column names. 
For self-describing file formats like ORC, having the real column names will 
improve the understandability.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4227) Add column level encryption to ORC files

2013-03-28 Thread Owen O'Malley (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13616329#comment-13616329
 ] 

Owen O'Malley commented on HIVE-4227:
-

Supun,
  I've tagged this for Google Summer of Code. Take a look at:
http://www.google-melange.com/gsoc/homepage/google/gsoc2013

> Add column level encryption to ORC files
> 
>
> Key: HIVE-4227
> URL: https://issues.apache.org/jira/browse/HIVE-4227
> Project: Hive
>  Issue Type: New Feature
>Reporter: Owen O'Malley
>  Labels: gsoc, gsoc2013
>
> It would be useful to support column level encryption in ORC files. Since 
> each column and its associated index is stored separately, encrypting a 
> column separately isn't difficult. In terms of key distribution, it would 
> make sense to use an external server like the one in HADOOP-9331.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HIVE-4242) Predicate push down should also be provided to InputFormats

2013-03-28 Thread Owen O'Malley (JIRA)
Owen O'Malley created HIVE-4242:
---

 Summary: Predicate push down should also be provided to 
InputFormats
 Key: HIVE-4242
 URL: https://issues.apache.org/jira/browse/HIVE-4242
 Project: Hive
  Issue Type: Bug
  Components: StorageHandler
Reporter: Owen O'Malley
Assignee: Owen O'Malley


Currently, the push down predicate is only provided to native tables if the 
hive.optimize.index.filter configuration variable is set. There is no reason to 
prevent InputFormats from getting the required information to do predicate push 
down.

Obviously, this will be very useful for ORC.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HIVE-4229) Create a hive-ql jar that doesn't include non-hive jars

2013-03-25 Thread Owen O'Malley (JIRA)
Owen O'Malley created HIVE-4229:
---

 Summary: Create a hive-ql jar that doesn't include non-hive jars
 Key: HIVE-4229
 URL: https://issues.apache.org/jira/browse/HIVE-4229
 Project: Hive
  Issue Type: New Feature
Reporter: Owen O'Malley


We currently only ship the ql module as part of the hive-exec jar that includes 
other projects (thrift, avro, protobuf, commons lang, json, java-ewah, and 
javolution). This forces downstream users to get the upstream projects too.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Question - why are there instances of org.apache.commons.lang.StringUtils and WordUtils bundled in hive?

2013-03-25 Thread Owen O'Malley
You're right. I was thinking there was a hive-ql jar, but there isn't.
(Note that they aren't duplicated in the source tree, just packaged up in
the jar.) I've created https://issues.apache.org/jira/browse/HIVE-4229 to
provide a jar of the ql classes without the upstream classes included.

Note that in the long stream, I think we need to simplify the jars, but
that is a bigger issue.

-- Owen


On Mon, Mar 25, 2013 at 3:23 PM, Dave Winterbourne <
dave.winterbou...@gmail.com> wrote:

> We have a custom User Defined Function that extends UDF - I'll admit some
> ignorance, as I inherited this code, but UDF is a class that comes from
> hive-exec, so it doesn't seem true that hive-exec is not intended for
> external usage. That having been said, my original question is why there
> are classes from commons-lang that are simply duplicated in the code base.
> This is bad form at best, but causes class collisions and thus duplicate
> class warnings.
>
> On Mon, Mar 25, 2013 at 2:48 PM, Owen O'Malley  wrote:
>
> > Hive-exec isn't meant for external usage. It is the bundled jar of Hive's
> > runtime dependencies that are required for Hive's MapReduce tasks. It
> > consists of :
> >
> > hive-common
> > hive-ql
> > hive-serde
> > hive-shims
> > thrift
> > commons-lang
> > json
> > avro
> > avro-mapred
> > java-ewah
> > javolution
> > protobuf-java
> >
> > -- Owen
> >
> >
> > On Mon, Mar 25, 2013 at 11:42 AM, Dave Winterbourne <
> > dave.winterbou...@gmail.com> wrote:
> >
> > > I have been working on eliminating duplicate class warnings in my maven
> > > build, and in the end discovered that there are two classes from apache
> > > commons-lang that are bundled with hive-exec:
> > >
> > > jar tf hive-0.10.0-bin//lib/hive-exec-0.10.0.jar | grep
> > > org/apache/commons/lang/
> > > org/apache/commons/lang/
> > > org/apache/commons/lang/StringUtils.class
> > > org/apache/commons/lang/WordUtils.class
> > >
> > > Why are these classes bundled with hive as opposed to just using
> > > commons-lang? If there truly is a need for custom functionality, why
> not
> > > put it in a different class to avoid this collision?
> > >
> >
>


Re: Question - why are there instances of org.apache.commons.lang.StringUtils and WordUtils bundled in hive?

2013-03-25 Thread Owen O'Malley
Hive-exec isn't meant for external usage. It is the bundled jar of Hive's
runtime dependencies that are required for Hive's MapReduce tasks. It
consists of :

hive-common
hive-ql
hive-serde
hive-shims
thrift
commons-lang
json
avro
avro-mapred
java-ewah
javolution
protobuf-java

-- Owen


On Mon, Mar 25, 2013 at 11:42 AM, Dave Winterbourne <
dave.winterbou...@gmail.com> wrote:

> I have been working on eliminating duplicate class warnings in my maven
> build, and in the end discovered that there are two classes from apache
> commons-lang that are bundled with hive-exec:
>
> jar tf hive-0.10.0-bin//lib/hive-exec-0.10.0.jar | grep
> org/apache/commons/lang/
> org/apache/commons/lang/
> org/apache/commons/lang/StringUtils.class
> org/apache/commons/lang/WordUtils.class
>
> Why are these classes bundled with hive as opposed to just using
> commons-lang? If there truly is a need for custom functionality, why not
> put it in a different class to avoid this collision?
>


[jira] [Created] (HIVE-4227) Add column level encryption to ORC files

2013-03-25 Thread Owen O'Malley (JIRA)
Owen O'Malley created HIVE-4227:
---

 Summary: Add column level encryption to ORC files
 Key: HIVE-4227
 URL: https://issues.apache.org/jira/browse/HIVE-4227
 Project: Hive
  Issue Type: New Feature
Reporter: Owen O'Malley


It would be useful to support column level encryption in ORC files. Since each 
column and its associated index is stored separately, encrypting a column 
separately isn't difficult. In terms of key distribution, it would make sense 
to use an external server like the one in HADOOP-9331.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4114) hive-metastore.jar depends on jdo2-api:jar:2.3-ec, which is missing in maven central

2013-03-13 Thread Owen O'Malley (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13601460#comment-13601460
 ] 

Owen O'Malley commented on HIVE-4114:
-

You'll also need to install the jdo2 jar in your maven repository:

{code}
download jdo2-api-2.3-ec.jar to your working directory
mvn install:install-file -DgroupId=javax.jdo -DartifactId=jdo2-api 
-Dversion=2.3-ec -Dpackaging=jar -Dfile=jdo2-api-2.3-ec.jar
{code}

The new jdo jar is available from 
http://www.datanucleus.org/downloads/maven2/javax/jdo/jdo2-api/2.3-ec/jdo2-api-2.3-ec.jar


> hive-metastore.jar depends on jdo2-api:jar:2.3-ec, which is missing in maven 
> central
> 
>
> Key: HIVE-4114
> URL: https://issues.apache.org/jira/browse/HIVE-4114
> Project: Hive
>  Issue Type: Bug
>  Components: Build Infrastructure
>Reporter: Gopal V
>Priority: Trivial
>
> Adding hive-exec-0.10.0 to an independent pom.xml results in the following 
> error
> {code}
> Failed to retrieve javax.jdo:jdo2-api-2.3-ec
> Caused by: Could not find artifact javax.jdo:jdo2-api:jar:2.3-ec in central 
> (http://repo1.maven.org/maven2)
> ...
> Path to dependency: 
>   1) org.notmysock.hive:plan-viewer:jar:1.0-SNAPSHOT
>   2) org.apache.hive:hive-exec:jar:0.10.0
>   3) org.apache.hive:hive-metastore:jar:0.10.0
>   4) javax.jdo:jdo2-api:jar:2.3-ec
> {code}
> From the best I could tell, in the hive build ant+ivy pulls this file from 
> the datanucleus repo
> http://www.datanucleus.org/downloads/maven2/javax/jdo/jdo2-api/2.3-ec/
> For completeness sake, the dependency needs to be pulled to maven central.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4156) need to add protobuf classes to hive-exec.jar

2013-03-13 Thread Owen O'Malley (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13601257#comment-13601257
 ] 

Owen O'Malley commented on HIVE-4156:
-

No worries, but thanks for removing the -1. Ironically, some of the testing for 
ORC is happening under Hadoop v2 where the issue doesn't come up since Hadoop 
v2 bundles protobuf.

> need to add protobuf classes to hive-exec.jar
> -
>
> Key: HIVE-4156
> URL: https://issues.apache.org/jira/browse/HIVE-4156
> Project: Hive
>  Issue Type: Bug
>  Components: Serializers/Deserializers
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
> Attachments: HIVE-4156.D9375.1.patch
>
>
> In some queries, the tasks fail when they can't find classes from the 
> protobuf library.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4156) need to add protobuf classes to hive-exec.jar

2013-03-13 Thread Owen O'Malley (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13601246#comment-13601246
 ] 

Owen O'Malley commented on HIVE-4156:
-

ORC does require protobuf, which is exactly how I hit this.

> need to add protobuf classes to hive-exec.jar
> -
>
> Key: HIVE-4156
> URL: https://issues.apache.org/jira/browse/HIVE-4156
> Project: Hive
>  Issue Type: Bug
>  Components: Serializers/Deserializers
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
> Attachments: HIVE-4156.D9375.1.patch
>
>
> In some queries, the tasks fail when they can't find classes from the 
> protobuf library.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HIVE-4156) need to add protobuf classes to hive-exec.jar

2013-03-13 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-4156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley updated HIVE-4156:


Status: Patch Available  (was: Open)

> need to add protobuf classes to hive-exec.jar
> -
>
> Key: HIVE-4156
> URL: https://issues.apache.org/jira/browse/HIVE-4156
> Project: Hive
>  Issue Type: Bug
>  Components: Serializers/Deserializers
>    Reporter: Owen O'Malley
>Assignee: Owen O'Malley
> Attachments: HIVE-4156.D9375.1.patch
>
>
> In some queries, the tasks fail when they can't find classes from the 
> protobuf library.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HIVE-4138) ORC's union object inspector returns a type name that isn't parseable by TypeInfoUtils

2013-03-12 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley updated HIVE-4138:


Attachment: h-4138.patch

This updates the patch since the decimal reader/writer went in.

> ORC's union object inspector returns a type name that isn't parseable by 
> TypeInfoUtils
> --
>
> Key: HIVE-4138
> URL: https://issues.apache.org/jira/browse/HIVE-4138
> Project: Hive
>  Issue Type: Bug
>  Components: Serializers/Deserializers
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
> Attachments: h-4138.patch, HIVE-4138.D9219.1.patch
>
>
> Currently the typename returned by ORC's union object inspector isn't 
> parseable by TypeInfoUtils. The format needs to be union.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HIVE-4138) ORC's union object inspector returns a type name that isn't parseable by TypeInfoUtils

2013-03-12 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley updated HIVE-4138:


Status: Patch Available  (was: Open)

> ORC's union object inspector returns a type name that isn't parseable by 
> TypeInfoUtils
> --
>
> Key: HIVE-4138
> URL: https://issues.apache.org/jira/browse/HIVE-4138
> Project: Hive
>  Issue Type: Bug
>  Components: Serializers/Deserializers
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
> Attachments: h-4138.patch, HIVE-4138.D9219.1.patch
>
>
> Currently the typename returned by ORC's union object inspector isn't 
> parseable by TypeInfoUtils. The format needs to be union.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HIVE-4156) need to add protobuf classes to hive-exec.jar

2013-03-12 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-4156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley updated HIVE-4156:


Component/s: Serializers/Deserializers

> need to add protobuf classes to hive-exec.jar
> -
>
> Key: HIVE-4156
> URL: https://issues.apache.org/jira/browse/HIVE-4156
> Project: Hive
>  Issue Type: Bug
>  Components: Serializers/Deserializers
>    Reporter: Owen O'Malley
>Assignee: Owen O'Malley
>
> In some queries, the tasks fail when they can't find classes from the 
> protobuf library.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HIVE-4156) need to add protobuf classes to hive-exec.jar

2013-03-12 Thread Owen O'Malley (JIRA)
Owen O'Malley created HIVE-4156:
---

 Summary: need to add protobuf classes to hive-exec.jar
 Key: HIVE-4156
 URL: https://issues.apache.org/jira/browse/HIVE-4156
 Project: Hive
  Issue Type: Bug
Reporter: Owen O'Malley
Assignee: Owen O'Malley


In some queries, the tasks fail when they can't find classes from the protobuf 
library.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HIVE-4138) ORC's union object inspector returns a type name that isn't parseable by TypeInfoUtils

2013-03-07 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley updated HIVE-4138:


Status: Patch Available  (was: Open)

> ORC's union object inspector returns a type name that isn't parseable by 
> TypeInfoUtils
> --
>
> Key: HIVE-4138
> URL: https://issues.apache.org/jira/browse/HIVE-4138
> Project: Hive
>  Issue Type: Bug
>  Components: Serializers/Deserializers
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
> Attachments: HIVE-4138.D9219.1.patch
>
>
> Currently the typename returned by ORC's union object inspector isn't 
> parseable by TypeInfoUtils. The format needs to be union.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HIVE-4138) ORC's union object inspector returns a type name that isn't parseable by TypeInfoUtils

2013-03-07 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley updated HIVE-4138:


Component/s: Serializers/Deserializers

> ORC's union object inspector returns a type name that isn't parseable by 
> TypeInfoUtils
> --
>
> Key: HIVE-4138
> URL: https://issues.apache.org/jira/browse/HIVE-4138
> Project: Hive
>  Issue Type: Bug
>  Components: Serializers/Deserializers
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
>
> Currently the typename returned by ORC's union object inspector isn't 
> parseable by TypeInfoUtils. The format needs to be union.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HIVE-4120) Implement decimal encoding for ORC

2013-03-07 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-4120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley updated HIVE-4120:


Status: Patch Available  (was: Open)

> Implement decimal encoding for ORC
> --
>
> Key: HIVE-4120
> URL: https://issues.apache.org/jira/browse/HIVE-4120
> Project: Hive
>  Issue Type: New Feature
>  Components: Serializers/Deserializers
>    Reporter: Owen O'Malley
>Assignee: Owen O'Malley
> Attachments: HIVE-4120.D9207.1.patch
>
>
> Currently, ORC does not have an encoder for decimal.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HIVE-4138) ORC's union object inspector returns a type name that isn't parseable by TypeInfoUtils

2013-03-07 Thread Owen O'Malley (JIRA)
Owen O'Malley created HIVE-4138:
---

 Summary: ORC's union object inspector returns a type name that 
isn't parseable by TypeInfoUtils
 Key: HIVE-4138
 URL: https://issues.apache.org/jira/browse/HIVE-4138
 Project: Hive
  Issue Type: Bug
    Reporter: Owen O'Malley


Currently the typename returned by ORC's union object inspector isn't parseable 
by TypeInfoUtils. The format needs to be union.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (HIVE-4138) ORC's union object inspector returns a type name that isn't parseable by TypeInfoUtils

2013-03-07 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley reassigned HIVE-4138:
---

Assignee: Owen O'Malley

> ORC's union object inspector returns a type name that isn't parseable by 
> TypeInfoUtils
> --
>
> Key: HIVE-4138
> URL: https://issues.apache.org/jira/browse/HIVE-4138
> Project: Hive
>  Issue Type: Bug
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
>
> Currently the typename returned by ORC's union object inspector isn't 
> parseable by TypeInfoUtils. The format needs to be union.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HIVE-4127) Testing with Hadoop 2.x causes test failure for ORC's TestFileDump

2013-03-05 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-4127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley updated HIVE-4127:


Status: Patch Available  (was: Open)

> Testing with Hadoop 2.x causes test failure for ORC's TestFileDump
> --
>
> Key: HIVE-4127
> URL: https://issues.apache.org/jira/browse/HIVE-4127
> Project: Hive
>  Issue Type: New Feature
>  Components: Serializers/Deserializers
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
> Attachments: HIVE-4127.D9111.1.patch
>
>
> Hadoop 2's junit is a newer version, which causes differences in behaviors of 
> the TestFileDump. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4015) Add ORC file to the grammar as a file format

2013-03-05 Thread Owen O'Malley (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13593987#comment-13593987
 ] 

Owen O'Malley commented on HIVE-4015:
-

+1 looks good to me.

> Add ORC file to the grammar as a file format
> 
>
> Key: HIVE-4015
> URL: https://issues.apache.org/jira/browse/HIVE-4015
> Project: Hive
>  Issue Type: Improvement
>    Reporter: Owen O'Malley
>Assignee: Gunther Hagleitner
> Attachments: HIVE-4015.1.patch, HIVE-4015.2.patch, HIVE-4015.3.patch, 
> HIVE-4015.4.patch
>
>
> It would be much more convenient for users if we enable them to use ORC as a 
> file format in the HQL grammar. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HIVE-4127) Testing with Hadoop 2.x causes test failure for ORC's TestFileDump

2013-03-05 Thread Owen O'Malley (JIRA)
Owen O'Malley created HIVE-4127:
---

 Summary: Testing with Hadoop 2.x causes test failure for ORC's 
TestFileDump
 Key: HIVE-4127
 URL: https://issues.apache.org/jira/browse/HIVE-4127
 Project: Hive
  Issue Type: New Feature
  Components: Serializers/Deserializers
Reporter: Owen O'Malley
    Assignee: Owen O'Malley


Hadoop 2's junit is a newer version, which causes differences in behaviors of 
the TestFileDump. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (HIVE-2899) Remove dependency on sun's jdk.

2013-03-05 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley resolved HIVE-2899.
-

Resolution: Invalid

I'm closing this.

> Remove dependency on sun's jdk.
> ---
>
> Key: HIVE-2899
> URL: https://issues.apache.org/jira/browse/HIVE-2899
> Project: Hive
>  Issue Type: Improvement
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
>
> When the signal handlers were added, they introduced a dependency on 
> sun.misc.Signal and sun.misc.SignalHandler. We can look these classes up by 
> reflection and avoid the warning and also provide a soft-fail for non-sun 
> jvms.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HIVE-4123) The RLE encoding for ORC can be improved

2013-03-05 Thread Owen O'Malley (JIRA)
Owen O'Malley created HIVE-4123:
---

 Summary: The RLE encoding for ORC can be improved
 Key: HIVE-4123
 URL: https://issues.apache.org/jira/browse/HIVE-4123
 Project: Hive
  Issue Type: New Feature
  Components: Serializers/Deserializers
Reporter: Owen O'Malley
Assignee: Owen O'Malley


The run length encoding of integers can be improved:
* tighter bit packing
* allow delta encoding
* allow longer runs

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HIVE-4121) ORC should have optional dictionaries for both strings and numeric types

2013-03-05 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-4121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley updated HIVE-4121:


Description: Currently string columns always have dictionaries and numerics 
are always directly encoded. It would be better to make the encoding depend on 
a sample of the data. Perhaps the first 100k values should be evaluated for 
repeated values and the encoding picked for the stripe.

> ORC should have optional dictionaries for both strings and numeric types
> 
>
> Key: HIVE-4121
> URL: https://issues.apache.org/jira/browse/HIVE-4121
> Project: Hive
>  Issue Type: New Feature
>  Components: Serializers/Deserializers
>    Reporter: Owen O'Malley
>Assignee: Owen O'Malley
>
> Currently string columns always have dictionaries and numerics are always 
> directly encoded. It would be better to make the encoding depend on a sample 
> of the data. Perhaps the first 100k values should be evaluated for repeated 
> values and the encoding picked for the stripe.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HIVE-4120) Implement decimal encoding for ORC

2013-03-05 Thread Owen O'Malley (JIRA)
Owen O'Malley created HIVE-4120:
---

 Summary: Implement decimal encoding for ORC
 Key: HIVE-4120
 URL: https://issues.apache.org/jira/browse/HIVE-4120
 Project: Hive
  Issue Type: New Feature
  Components: Serializers/Deserializers
Reporter: Owen O'Malley
Assignee: Owen O'Malley


Currently, ORC does not have an encoder for decimal.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HIVE-4121) ORC should have optional dictionaries for both strings and numeric types

2013-03-05 Thread Owen O'Malley (JIRA)
Owen O'Malley created HIVE-4121:
---

 Summary: ORC should have optional dictionaries for both strings 
and numeric types
 Key: HIVE-4121
 URL: https://issues.apache.org/jira/browse/HIVE-4121
 Project: Hive
  Issue Type: New Feature
  Components: Serializers/Deserializers
Reporter: Owen O'Malley
Assignee: Owen O'Malley




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4113) select count(1) reads all columns with RCFile

2013-03-04 Thread Owen O'Malley (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13592751#comment-13592751
 ] 

Owen O'Malley commented on HIVE-4113:
-

There are a couple of context where Hive assumes that an empty string means all 
columns. Those as well as the code in ORC and RCFile will need to be fixed.

> select count(1) reads all columns with RCFile
> -
>
> Key: HIVE-4113
> URL: https://issues.apache.org/jira/browse/HIVE-4113
> Project: Hive
>  Issue Type: Bug
>Reporter: Gopal V
>
> select count(1) loads up every column & every row when used with RCFile.
> "select count(1) from store_sales_10_rc" gives
> {code}
> Job 0: Map: 5  Reduce: 1   Cumulative CPU: 31.73 sec   HDFS Read: 234914410 
> HDFS Write: 8 SUCCESS
> {code}
> Where as, "select count(ss_sold_date_sk) from store_sales_10_rc;" reads far 
> less
> {code}
> Job 0: Map: 5  Reduce: 1   Cumulative CPU: 29.75 sec   HDFS Read: 28145994 
> HDFS Write: 8 SUCCESS
> {code}
> Which is 11% of the data size read by the COUNT(1).
> This was tracked down to the following code in RCFile.java
> {code}
>   } else {
> // TODO: if no column name is specified e.g, in select count(1) from 
> tt;
> // skip all columns, this should be distinguished from the case:
> // select * from tt;
> for (int i = 0; i < skippedColIDs.length; i++) {
>   skippedColIDs[i] = false;
> }
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HIVE-4098) OrcInputFormat assumes Hive always calls createValue

2013-03-01 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-4098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley updated HIVE-4098:


Status: Patch Available  (was: Open)

The patch removes the assumption of a dedicated row for each RecordReader.

> OrcInputFormat assumes Hive always calls createValue
> 
>
> Key: HIVE-4098
> URL: https://issues.apache.org/jira/browse/HIVE-4098
> Project: Hive
>  Issue Type: Bug
>  Components: Serializers/Deserializers
>    Reporter: Owen O'Malley
>Assignee: Owen O'Malley
> Attachments: HIVE-4098.D9021.1.patch
>
>
> Hive's HiveContextAwareRecordReader doesn't create a new value for each 
> InputFormat and instead reuses the same row between input formats. That 
> causes the first record of second (and third, etc.) partition to be dropped 
> and replaced with the last row of the previous partition.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HIVE-4097) ORC file doesn't properly interpret empty hive.io.file.readcolumn.ids

2013-03-01 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-4097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley updated HIVE-4097:


Status: Patch Available  (was: Open)

This patch fixes the problem and adds a test case to ensure that the empty 
string is correctly handled.

> ORC file doesn't properly interpret empty hive.io.file.readcolumn.ids
> -
>
> Key: HIVE-4097
> URL: https://issues.apache.org/jira/browse/HIVE-4097
> Project: Hive
>  Issue Type: Bug
>  Components: Serializers/Deserializers
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
> Attachments: HIVE-4097.D9015.1.patch
>
>
> Hive assumes that an empty string in hive.io.file.readcolumn.ids means all 
> columns. The ORC reader currently assumes it means no columns.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4015) Add ORC file to the grammar as a file format

2013-03-01 Thread Owen O'Malley (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13590722#comment-13590722
 ] 

Owen O'Malley commented on HIVE-4015:
-

Gunther, this looks good. I'd suggest removing the code that lets you override 
the serde, since with ORC you really don't want to do that.

> Add ORC file to the grammar as a file format
> 
>
> Key: HIVE-4015
> URL: https://issues.apache.org/jira/browse/HIVE-4015
> Project: Hive
>  Issue Type: Improvement
>Reporter: Owen O'Malley
>Assignee: Gunther Hagleitner
> Attachments: HIVE-4015.1.patch, HIVE-4015.2.patch, HIVE-4015.3.patch
>
>
> It would be much more convenient for users if we enable them to use ORC as a 
> file format in the HQL grammar. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HIVE-4098) OrcInputFormat assumes Hive always calls createValue

2013-03-01 Thread Owen O'Malley (JIRA)
Owen O'Malley created HIVE-4098:
---

 Summary: OrcInputFormat assumes Hive always calls createValue
 Key: HIVE-4098
 URL: https://issues.apache.org/jira/browse/HIVE-4098
 Project: Hive
  Issue Type: Bug
  Components: Serializers/Deserializers
Reporter: Owen O'Malley
Assignee: Owen O'Malley


Hive's HiveContextAwareRecordReader doesn't create a new value for each 
InputFormat and instead reuses the same row between input formats. That causes 
the first record of second (and third, etc.) partition to be dropped and 
replaced with the last row of the previous partition.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HIVE-4097) ORC file doesn't properly interpret empty hive.io.file.readcolumn.ids

2013-03-01 Thread Owen O'Malley (JIRA)
Owen O'Malley created HIVE-4097:
---

 Summary: ORC file doesn't properly interpret empty 
hive.io.file.readcolumn.ids
 Key: HIVE-4097
 URL: https://issues.apache.org/jira/browse/HIVE-4097
 Project: Hive
  Issue Type: Bug
  Components: Serializers/Deserializers
Reporter: Owen O'Malley
    Assignee: Owen O'Malley


Hive assumes that an empty string in hive.io.file.readcolumn.ids means all 
columns. The ORC reader currently assumes it means no columns.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HIVE-3874) Create a new Optimized Row Columnar file format for Hive

2013-03-01 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley updated HIVE-3874:


Status: Patch Available  (was: Open)

Pamela,
  Yeah, that probably makes sense. I'll file the follow up jiras.

> Create a new Optimized Row Columnar file format for Hive
> 
>
> Key: HIVE-3874
> URL: https://issues.apache.org/jira/browse/HIVE-3874
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
> Attachments: hive.3874.2.patch, HIVE-3874.D8529.1.patch, 
> HIVE-3874.D8529.2.patch, HIVE-3874.D8529.3.patch, HIVE-3874.D8529.4.patch, 
> HIVE-3874.D8871.1.patch, OrcFileIntro.pptx, orc.tgz
>
>
> There are several limitations of the current RC File format that I'd like to 
> address by creating a new format:
> * each column value is stored as a binary blob, which means:
> ** the entire column value must be read, decompressed, and deserialized
> ** the file format can't use smarter type-specific compression
> ** push down filters can't be evaluated
> * the start of each row group needs to be found by scanning
> * user metadata can only be added to the file when the file is created
> * the file doesn't store the number of rows per a file or row group
> * there is no mechanism for seeking to a particular row number, which is 
> required for external indexes.
> * there is no mechanism for storing light weight indexes within the file to 
> enable push-down filters to skip entire row groups.
> * the type of the rows aren't stored in the file

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-3874) Create a new Optimized Row Columnar file format for Hive

2013-02-27 Thread Owen O'Malley (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13588983#comment-13588983
 ] 

Owen O'Malley commented on HIVE-3874:
-

I'm actually tracking down a bug that Gunther found with a query. Let me finish 
track it down.

> Create a new Optimized Row Columnar file format for Hive
> 
>
> Key: HIVE-3874
> URL: https://issues.apache.org/jira/browse/HIVE-3874
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
> Attachments: hive.3874.2.patch, HIVE-3874.D8529.1.patch, 
> HIVE-3874.D8529.2.patch, HIVE-3874.D8529.3.patch, HIVE-3874.D8529.4.patch, 
> HIVE-3874.D8871.1.patch, OrcFileIntro.pptx, orc.tgz
>
>
> There are several limitations of the current RC File format that I'd like to 
> address by creating a new format:
> * each column value is stored as a binary blob, which means:
> ** the entire column value must be read, decompressed, and deserialized
> ** the file format can't use smarter type-specific compression
> ** push down filters can't be evaluated
> * the start of each row group needs to be found by scanning
> * user metadata can only be added to the file when the file is created
> * the file doesn't store the number of rows per a file or row group
> * there is no mechanism for seeking to a particular row number, which is 
> required for external indexes.
> * there is no mechanism for storing light weight indexes within the file to 
> enable push-down filters to skip entire row groups.
> * the type of the rows aren't stored in the file

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (HIVE-4058) make ORC versioned

2013-02-25 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-4058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley resolved HIVE-4058.
-

Resolution: Won't Fix

> make ORC versioned
> --
>
> Key: HIVE-4058
> URL: https://issues.apache.org/jira/browse/HIVE-4058
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Reporter: Namit Jain
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (HIVE-4061) skip columns which are not accessed in the query for ORC

2013-02-25 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-4061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley resolved HIVE-4061.
-

Resolution: Cannot Reproduce

This is already done.

> skip columns which are not accessed in the query for ORC
> 
>
> Key: HIVE-4061
> URL: https://issues.apache.org/jira/browse/HIVE-4061
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Reporter: Namit Jain
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4058) make ORC versioned

2013-02-25 Thread Owen O'Malley (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13586398#comment-13586398
 ] 

Owen O'Malley commented on HIVE-4058:
-

I should also note that if it is required at some point, we can always create 
such a field in the footer and treat that missing field as a version 0.

> make ORC versioned
> --
>
> Key: HIVE-4058
> URL: https://issues.apache.org/jira/browse/HIVE-4058
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Reporter: Namit Jain
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (HIVE-4059) Make Column statistics for ORC optional

2013-02-25 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-4059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley reassigned HIVE-4059:
---

Assignee: Owen O'Malley

> Make Column statistics for ORC optional
> ---
>
> Key: HIVE-4059
> URL: https://issues.apache.org/jira/browse/HIVE-4059
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Reporter: Namit Jain
>Assignee: Owen O'Malley
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4058) make ORC versioned

2013-02-25 Thread Owen O'Malley (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13586021#comment-13586021
 ] 

Owen O'Malley commented on HIVE-4058:
-

The metadata is versioned, it just doesn't have a global version. The intent is 
that new fields can be added to the protobuf and the reader will check if those 
new fields are defined.

> make ORC versioned
> --
>
> Key: HIVE-4058
> URL: https://issues.apache.org/jira/browse/HIVE-4058
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Reporter: Namit Jain
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HIVE-3874) Create a new Optimized Row Columnar file format for Hive

2013-02-25 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley updated HIVE-3874:


Status: Patch Available  (was: Open)

> Create a new Optimized Row Columnar file format for Hive
> 
>
> Key: HIVE-3874
> URL: https://issues.apache.org/jira/browse/HIVE-3874
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>    Reporter: Owen O'Malley
>Assignee: Owen O'Malley
> Attachments: hive.3874.2.patch, HIVE-3874.D8529.1.patch, 
> HIVE-3874.D8529.2.patch, HIVE-3874.D8529.3.patch, HIVE-3874.D8529.4.patch, 
> HIVE-3874.D8871.1.patch, OrcFileIntro.pptx, orc.tgz
>
>
> There are several limitations of the current RC File format that I'd like to 
> address by creating a new format:
> * each column value is stored as a binary blob, which means:
> ** the entire column value must be read, decompressed, and deserialized
> ** the file format can't use smarter type-specific compression
> ** push down filters can't be evaluated
> * the start of each row group needs to be found by scanning
> * user metadata can only be added to the file when the file is created
> * the file doesn't store the number of rows per a file or row group
> * there is no mechanism for seeking to a particular row number, which is 
> required for external indexes.
> * there is no mechanism for storing light weight indexes within the file to 
> enable push-down filters to skip entire row groups.
> * the type of the rows aren't stored in the file

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-3874) Create a new Optimized Row Columnar file format for Hive

2013-02-25 Thread Owen O'Malley (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13586018#comment-13586018
 ] 

Owen O'Malley commented on HIVE-3874:
-

Ok, I added some additional comments in the Writer as Namit asked and all of 
the unit tests cases pass.

> Create a new Optimized Row Columnar file format for Hive
> 
>
> Key: HIVE-3874
> URL: https://issues.apache.org/jira/browse/HIVE-3874
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
> Attachments: hive.3874.2.patch, HIVE-3874.D8529.1.patch, 
> HIVE-3874.D8529.2.patch, HIVE-3874.D8529.3.patch, HIVE-3874.D8529.4.patch, 
> HIVE-3874.D8871.1.patch, OrcFileIntro.pptx, orc.tgz
>
>
> There are several limitations of the current RC File format that I'd like to 
> address by creating a new format:
> * each column value is stored as a binary blob, which means:
> ** the entire column value must be read, decompressed, and deserialized
> ** the file format can't use smarter type-specific compression
> ** push down filters can't be evaluated
> * the start of each row group needs to be found by scanning
> * user metadata can only be added to the file when the file is created
> * the file doesn't store the number of rows per a file or row group
> * there is no mechanism for seeking to a particular row number, which is 
> required for external indexes.
> * there is no mechanism for storing light weight indexes within the file to 
> enable push-down filters to skip entire row groups.
> * the type of the rows aren't stored in the file

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4000) Hive client goes into infinite loop at 100% cpu

2013-02-15 Thread Owen O'Malley (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13579591#comment-13579591
 ] 

Owen O'Malley commented on HIVE-4000:
-

The kind of query that is creating the problem looks like:

{code}
from Tbl
insert ...
insert ...
{code}

The customer sees the problem with 50 or more inserts.

> Hive client goes into infinite loop at 100% cpu
> ---
>
> Key: HIVE-4000
> URL: https://issues.apache.org/jira/browse/HIVE-4000
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 0.9.0
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
> Fix For: 0.10.1
>
> Attachments: HIVE-4000.D8493.1.patch
>
>
> The Hive client starts multiple threads to track the progress of the 
> MapReduce jobs. Unfortunately those threads access several static HashMaps 
> that are not protected by locks. When the HashMaps are modified, they 
> sometimes cause race conditions that lead to the client threads getting stuck 
> in infinite loops.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HIVE-3874) Create a new Optimized Row Columnar file format for Hive

2013-02-13 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley updated HIVE-3874:


Status: Patch Available  (was: Open)

> Create a new Optimized Row Columnar file format for Hive
> 
>
> Key: HIVE-3874
> URL: https://issues.apache.org/jira/browse/HIVE-3874
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>    Reporter: Owen O'Malley
>Assignee: Owen O'Malley
> Attachments: hive.3874.2.patch, HIVE-3874.D8529.1.patch, 
> OrcFileIntro.pptx, orc.tgz
>
>
> There are several limitations of the current RC File format that I'd like to 
> address by creating a new format:
> * each column value is stored as a binary blob, which means:
> ** the entire column value must be read, decompressed, and deserialized
> ** the file format can't use smarter type-specific compression
> ** push down filters can't be evaluated
> * the start of each row group needs to be found by scanning
> * user metadata can only be added to the file when the file is created
> * the file doesn't store the number of rows per a file or row group
> * there is no mechanism for seeking to a particular row number, which is 
> required for external indexes.
> * there is no mechanism for storing light weight indexes within the file to 
> enable push-down filters to skip entire row groups.
> * the type of the rows aren't stored in the file

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (HIVE-4015) Add ORC file to the grammar as a file format

2013-02-12 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-4015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley reassigned HIVE-4015:
---

Assignee: Owen O'Malley

> Add ORC file to the grammar as a file format
> 
>
> Key: HIVE-4015
> URL: https://issues.apache.org/jira/browse/HIVE-4015
> Project: Hive
>  Issue Type: Improvement
>    Reporter: Owen O'Malley
>Assignee: Owen O'Malley
>
> It would be much more convenient for users if we enable them to use ORC as a 
> file format in the HQL grammar. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HIVE-4015) Add ORC file to the grammar as a file format

2013-02-12 Thread Owen O'Malley (JIRA)
Owen O'Malley created HIVE-4015:
---

 Summary: Add ORC file to the grammar as a file format
 Key: HIVE-4015
 URL: https://issues.apache.org/jira/browse/HIVE-4015
 Project: Hive
  Issue Type: Improvement
Reporter: Owen O'Malley


It would be much more convenient for users if we enable them to use ORC as a 
file format in the HQL grammar. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-3874) Create a new Optimized Row Columnar file format for Hive

2013-02-12 Thread Owen O'Malley (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13576691#comment-13576691
 ] 

Owen O'Malley commented on HIVE-3874:
-

Kevin, I had some distractions at work, but I should get the patch uploaded 
today.

> Create a new Optimized Row Columnar file format for Hive
> 
>
> Key: HIVE-3874
> URL: https://issues.apache.org/jira/browse/HIVE-3874
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
> Attachments: hive.3874.2.patch, OrcFileIntro.pptx, orc.tgz
>
>
> There are several limitations of the current RC File format that I'd like to 
> address by creating a new format:
> * each column value is stored as a binary blob, which means:
> ** the entire column value must be read, decompressed, and deserialized
> ** the file format can't use smarter type-specific compression
> ** push down filters can't be evaluated
> * the start of each row group needs to be found by scanning
> * user metadata can only be added to the file when the file is created
> * the file doesn't store the number of rows per a file or row group
> * there is no mechanism for seeking to a particular row number, which is 
> required for external indexes.
> * there is no mechanism for storing light weight indexes within the file to 
> enable push-down filters to skip entire row groups.
> * the type of the rows aren't stored in the file

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HIVE-4000) Hive client goes into infinite loop at 100% cpu

2013-02-11 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-4000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley updated HIVE-4000:


Status: Patch Available  (was: Open)

Replace the sets/maps with concurrent versions that protect against concurrent 
access from multiple threads.

> Hive client goes into infinite loop at 100% cpu
> ---
>
> Key: HIVE-4000
> URL: https://issues.apache.org/jira/browse/HIVE-4000
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 0.9.0
>    Reporter: Owen O'Malley
>Assignee: Owen O'Malley
> Fix For: 0.10.1
>
> Attachments: HIVE-4000.D8493.1.patch
>
>
> The Hive client starts multiple threads to track the progress of the 
> MapReduce jobs. Unfortunately those threads access several static HashMaps 
> that are not protected by locks. When the HashMaps are modified, they 
> sometimes cause race conditions that lead to the client threads getting stuck 
> in infinite loops.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HIVE-4000) Hive client goes into infinite loop at 100% cpu

2013-02-07 Thread Owen O'Malley (JIRA)
Owen O'Malley created HIVE-4000:
---

 Summary: Hive client goes into infinite loop at 100% cpu
 Key: HIVE-4000
 URL: https://issues.apache.org/jira/browse/HIVE-4000
 Project: Hive
  Issue Type: Bug
Affects Versions: 0.9.0
Reporter: Owen O'Malley
Assignee: Owen O'Malley
 Fix For: 0.10.1


The Hive client starts multiple threads to track the progress of the MapReduce 
jobs. Unfortunately those threads access several static HashMaps that are not 
protected by locks. When the HashMaps are modified, they sometimes cause race 
conditions that lead to the client threads getting stuck in infinite loops.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-3874) Create a new Optimized Row Columnar file format for Hive

2013-02-06 Thread Owen O'Malley (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13572623#comment-13572623
 ] 

Owen O'Malley commented on HIVE-3874:
-

I've pushed the current version up to 
[github|http://github.com/hortonworks/orc] with the seek to record implemented. 
Does it make more sense to put ORC into serde or ql? RCFile is in ql, so I'd 
assumed it would go there. Thoughts?

> Create a new Optimized Row Columnar file format for Hive
> 
>
> Key: HIVE-3874
> URL: https://issues.apache.org/jira/browse/HIVE-3874
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
> Attachments: hive.3874.2.patch, OrcFileIntro.pptx, orc.tgz
>
>
> There are several limitations of the current RC File format that I'd like to 
> address by creating a new format:
> * each column value is stored as a binary blob, which means:
> ** the entire column value must be read, decompressed, and deserialized
> ** the file format can't use smarter type-specific compression
> ** push down filters can't be evaluated
> * the start of each row group needs to be found by scanning
> * user metadata can only be added to the file when the file is created
> * the file doesn't store the number of rows per a file or row group
> * there is no mechanism for seeking to a particular row number, which is 
> required for external indexes.
> * there is no mechanism for storing light weight indexes within the file to 
> enable push-down filters to skip entire row groups.
> * the type of the rows aren't stored in the file

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-3874) Create a new Optimized Row Columnar file format for Hive

2013-02-05 Thread Owen O'Malley (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13571510#comment-13571510
 ] 

Owen O'Malley commented on HIVE-3874:
-

[~kevinwilfong] Thanks for the bug fixes, Kevin. I pushed the DynamicByteArray 
and double serialization fixes to [github|https://github.com/hortonworks/orc]. 
I have the null column problem fixed, but it is tied into my other changes on 
my row-seek dev branch. I hope to finish up the row-seek today and I'll merge 
it into master and make the patch putting it into Hive.

> Create a new Optimized Row Columnar file format for Hive
> 
>
> Key: HIVE-3874
> URL: https://issues.apache.org/jira/browse/HIVE-3874
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
> Attachments: hive.3874.2.patch, OrcFileIntro.pptx, orc.tgz
>
>
> There are several limitations of the current RC File format that I'd like to 
> address by creating a new format:
> * each column value is stored as a binary blob, which means:
> ** the entire column value must be read, decompressed, and deserialized
> ** the file format can't use smarter type-specific compression
> ** push down filters can't be evaluated
> * the start of each row group needs to be found by scanning
> * user metadata can only be added to the file when the file is created
> * the file doesn't store the number of rows per a file or row group
> * there is no mechanism for seeking to a particular row number, which is 
> required for external indexes.
> * there is no mechanism for storing light weight indexes within the file to 
> enable push-down filters to skip entire row groups.
> * the type of the rows aren't stored in the file

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-3874) Create a new Optimized Row Columnar file format for Hive

2013-01-30 Thread Owen O'Malley (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13566707#comment-13566707
 ] 

Owen O'Malley commented on HIVE-3874:
-

[~namit], I've got one more feature that I'm working on (seek to row) and then 
I'll make a patch. I'm aiming to upload the patch on Friday.

> Create a new Optimized Row Columnar file format for Hive
> 
>
> Key: HIVE-3874
> URL: https://issues.apache.org/jira/browse/HIVE-3874
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>    Reporter: Owen O'Malley
>Assignee: Owen O'Malley
> Attachments: OrcFileIntro.pptx, orc.tgz
>
>
> There are several limitations of the current RC File format that I'd like to 
> address by creating a new format:
> * each column value is stored as a binary blob, which means:
> ** the entire column value must be read, decompressed, and deserialized
> ** the file format can't use smarter type-specific compression
> ** push down filters can't be evaluated
> * the start of each row group needs to be found by scanning
> * user metadata can only be added to the file when the file is created
> * the file doesn't store the number of rows per a file or row group
> * there is no mechanism for seeking to a particular row number, which is 
> required for external indexes.
> * there is no mechanism for storing light weight indexes within the file to 
> enable push-down filters to skip entire row groups.
> * the type of the rows aren't stored in the file

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: [VOTE] Amend Hive Bylaws + Add HCatalog Submodule

2013-01-28 Thread Owen O'Malley
+1 and +1


On Mon, Jan 28, 2013 at 1:56 PM, Ashish Thusoo  wrote:

> Measure 1: +1
> Measure 2: +1
>
> Ashish
>
>
> On Mon, Jan 28, 2013 at 1:11 PM, Ashutosh Chauhan  >wrote:
>
> > Measure 1: +1
> > Measure 2: +1
> >
> > Ashutosh
> >
> >
> > On Mon, Jan 28, 2013 at 11:48 AM, Carl Steinbach  wrote:
> >
> >> Measure 1: +1 (binding)
> >> Measure 2: +1 (binding)
> >>
> >> On Mon, Jan 28, 2013 at 11:47 AM, Carl Steinbach 
> wrote:
> >>
> >> > I am calling a vote on the following two measures.
> >> >
> >> > Measure 1: Amend Hive Bylaws to Define Submodules and Submodule
> >> Committers
> >> >
> >> > If this measure passes the Apache Hive Project Bylaws will be
> >> > amended with the following changes:
> >> >
> >> >
> >> >
> >>
> https://cwiki.apache.org/confluence/display/Hive/Proposed+Changes+to+Hive+Bylaws+for+Submodule+Committers
> >> >
> >> > The motivation for these changes is discussed in the following
> >> > email thread which appeared on the hive-dev and hcatalog-dev
> >> > mailing lists:
> >> >
> >> > http://markmail.org/thread/u5nap7ghvyo7euqa
> >> >
> >> >
> >> > Measure 2: Create HCatalog Submodule and Adopt HCatalog Codebase
> >> >
> >> > This measure provides for 1) the establishment of an HCatalog
> >> > submodule in the Apache Hive Project, 2) the adoption of the
> >> > Apache HCatalog codebase into the Hive HCatalog submodule, and
> >> > 3) adding all currently active HCatalog committers as submodule
> >> > committers on the Hive HCatalog submodule.
> >> >
> >> > Passage of this measure depends on the passage of Measure 1.
> >> >
> >> >
> >> > Voting:
> >> >
> >> > Both measures require +1 votes from 2/3 of active Hive PMC
> >> > members in order to pass. All participants in the Hive project
> >> > are encouraged to vote on these measures, but only votes from
> >> > active Hive PMC members are binding. The voting period
> >> > commences immediately and shall last a minimum of six days.
> >> >
> >> > Voting is carried out by replying to this email thread. You must
> >> > indicate which measure you are voting on in order for your vote
> >> > to be counted.
> >> >
> >> > More details about the voting process can be found in the Apache
> >> > Hive Project Bylaws:
> >> >
> >> > https://cwiki.apache.org/confluence/display/Hive/Bylaws
> >> >
> >> >
> >>
> >
> >
>


[jira] [Updated] (HIVE-3874) Create a new Optimized Row Columnar file format for Hive

2013-01-22 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley updated HIVE-3874:


Attachment: orc.tgz

I've fixed some bugs.

> Create a new Optimized Row Columnar file format for Hive
> 
>
> Key: HIVE-3874
> URL: https://issues.apache.org/jira/browse/HIVE-3874
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
> Attachments: OrcFileIntro.pptx, orc.tgz
>
>
> There are several limitations of the current RC File format that I'd like to 
> address by creating a new format:
> * each column value is stored as a binary blob, which means:
> ** the entire column value must be read, decompressed, and deserialized
> ** the file format can't use smarter type-specific compression
> ** push down filters can't be evaluated
> * the start of each row group needs to be found by scanning
> * user metadata can only be added to the file when the file is created
> * the file doesn't store the number of rows per a file or row group
> * there is no mechanism for seeking to a particular row number, which is 
> required for external indexes.
> * there is no mechanism for storing light weight indexes within the file to 
> enable push-down filters to skip entire row groups.
> * the type of the rows aren't stored in the file

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HIVE-3874) Create a new Optimized Row Columnar file format for Hive

2013-01-22 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley updated HIVE-3874:


Attachment: (was: orc.tgz)

> Create a new Optimized Row Columnar file format for Hive
> 
>
> Key: HIVE-3874
> URL: https://issues.apache.org/jira/browse/HIVE-3874
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>    Reporter: Owen O'Malley
>Assignee: Owen O'Malley
> Attachments: OrcFileIntro.pptx, orc.tgz
>
>
> There are several limitations of the current RC File format that I'd like to 
> address by creating a new format:
> * each column value is stored as a binary blob, which means:
> ** the entire column value must be read, decompressed, and deserialized
> ** the file format can't use smarter type-specific compression
> ** push down filters can't be evaluated
> * the start of each row group needs to be found by scanning
> * user metadata can only be added to the file when the file is created
> * the file doesn't store the number of rows per a file or row group
> * there is no mechanism for seeking to a particular row number, which is 
> required for external indexes.
> * there is no mechanism for storing light weight indexes within the file to 
> enable push-down filters to skip entire row groups.
> * the type of the rows aren't stored in the file

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HIVE-3874) Create a new Optimized Row Columnar file format for Hive

2013-01-22 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley updated HIVE-3874:


Attachment: (was: orc.tgz)

> Create a new Optimized Row Columnar file format for Hive
> 
>
> Key: HIVE-3874
> URL: https://issues.apache.org/jira/browse/HIVE-3874
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>    Reporter: Owen O'Malley
>Assignee: Owen O'Malley
> Attachments: OrcFileIntro.pptx, orc.tgz
>
>
> There are several limitations of the current RC File format that I'd like to 
> address by creating a new format:
> * each column value is stored as a binary blob, which means:
> ** the entire column value must be read, decompressed, and deserialized
> ** the file format can't use smarter type-specific compression
> ** push down filters can't be evaluated
> * the start of each row group needs to be found by scanning
> * user metadata can only be added to the file when the file is created
> * the file doesn't store the number of rows per a file or row group
> * there is no mechanism for seeking to a particular row number, which is 
> required for external indexes.
> * there is no mechanism for storing light weight indexes within the file to 
> enable push-down filters to skip entire row groups.
> * the type of the rows aren't stored in the file

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HIVE-3874) Create a new Optimized Row Columnar file format for Hive

2013-01-18 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley updated HIVE-3874:


Attachment: orc.tgz

I've updated the patch with the index suppression option that Nammit asked for.

> Create a new Optimized Row Columnar file format for Hive
> 
>
> Key: HIVE-3874
> URL: https://issues.apache.org/jira/browse/HIVE-3874
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
> Attachments: OrcFileIntro.pptx, orc.tgz, orc.tgz
>
>
> There are several limitations of the current RC File format that I'd like to 
> address by creating a new format:
> * each column value is stored as a binary blob, which means:
> ** the entire column value must be read, decompressed, and deserialized
> ** the file format can't use smarter type-specific compression
> ** push down filters can't be evaluated
> * the start of each row group needs to be found by scanning
> * user metadata can only be added to the file when the file is created
> * the file doesn't store the number of rows per a file or row group
> * there is no mechanism for seeking to a particular row number, which is 
> required for external indexes.
> * there is no mechanism for storing light weight indexes within the file to 
> enable push-down filters to skip entire row groups.
> * the type of the rows aren't stored in the file

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HIVE-3874) Create a new Optimized Row Columnar file format for Hive

2013-01-18 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley updated HIVE-3874:


Attachment: orc.tgz

Here's the current version of the code. The seek to row isn't implemented and 
it is still a standalone project, but it will let people start looking at it.

> Create a new Optimized Row Columnar file format for Hive
> 
>
> Key: HIVE-3874
> URL: https://issues.apache.org/jira/browse/HIVE-3874
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
> Attachments: OrcFileIntro.pptx, orc.tgz
>
>
> There are several limitations of the current RC File format that I'd like to 
> address by creating a new format:
> * each column value is stored as a binary blob, which means:
> ** the entire column value must be read, decompressed, and deserialized
> ** the file format can't use smarter type-specific compression
> ** push down filters can't be evaluated
> * the start of each row group needs to be found by scanning
> * user metadata can only be added to the file when the file is created
> * the file doesn't store the number of rows per a file or row group
> * there is no mechanism for seeking to a particular row number, which is 
> required for external indexes.
> * there is no mechanism for storing light weight indexes within the file to 
> enable push-down filters to skip entire row groups.
> * the type of the rows aren't stored in the file

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-3874) Create a new Optimized Row Columnar file format for Hive

2013-01-18 Thread Owen O'Malley (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13557353#comment-13557353
 ] 

Owen O'Malley commented on HIVE-3874:
-

Yin, large stripes (and I'm defaulting to 250MB) enable efficient reads from 
HDFS. The row indexes help address the issue of the large stripes by providing 
the offsets within the large stripes.

> Create a new Optimized Row Columnar file format for Hive
> 
>
> Key: HIVE-3874
> URL: https://issues.apache.org/jira/browse/HIVE-3874
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
> Attachments: OrcFileIntro.pptx
>
>
> There are several limitations of the current RC File format that I'd like to 
> address by creating a new format:
> * each column value is stored as a binary blob, which means:
> ** the entire column value must be read, decompressed, and deserialized
> ** the file format can't use smarter type-specific compression
> ** push down filters can't be evaluated
> * the start of each row group needs to be found by scanning
> * user metadata can only be added to the file when the file is created
> * the file doesn't store the number of rows per a file or row group
> * there is no mechanism for seeking to a particular row number, which is 
> required for external indexes.
> * there is no mechanism for storing light weight indexes within the file to 
> enable push-down filters to skip entire row groups.
> * the type of the rows aren't stored in the file

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-3874) Create a new Optimized Row Columnar file format for Hive

2013-01-18 Thread Owen O'Malley (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13557341#comment-13557341
 ] 

Owen O'Malley commented on HIVE-3874:
-

Joydeep, I've used a two level strategy:
  * large stripes (default 250MB) to enable large efficient reads
  * relatively frequent row index entries (default 10k rows) to enable skipping 
with in a stripe

The row index entries have the locations within each column to enable seeking 
to the right compression block and byte within the decompressed block.

I obviously did consider HFile, although from a practical point of view it is 
fairly embedded within HBase. Additionally, since it treats each of the columns 
as bytes it can't do any type-specific encodings/compression and can't 
interpret the column values, which is critical for performance.

Once you have the ability to skip large sets of rows based on the filter 
predicates, you can sort the table on the secondary keys and achieve a large 
speed up. For example, if your primary partition is transaction date, you might 
want to sort the table on state, zip, and last name. Then if you are looking 
for just the records in CA it won't need to read the records for the other 
states.




> Create a new Optimized Row Columnar file format for Hive
> 
>
> Key: HIVE-3874
> URL: https://issues.apache.org/jira/browse/HIVE-3874
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>    Reporter: Owen O'Malley
>Assignee: Owen O'Malley
> Attachments: OrcFileIntro.pptx
>
>
> There are several limitations of the current RC File format that I'd like to 
> address by creating a new format:
> * each column value is stored as a binary blob, which means:
> ** the entire column value must be read, decompressed, and deserialized
> ** the file format can't use smarter type-specific compression
> ** push down filters can't be evaluated
> * the start of each row group needs to be found by scanning
> * user metadata can only be added to the file when the file is created
> * the file doesn't store the number of rows per a file or row group
> * there is no mechanism for seeking to a particular row number, which is 
> required for external indexes.
> * there is no mechanism for storing light weight indexes within the file to 
> enable push-down filters to skip entire row groups.
> * the type of the rows aren't stored in the file

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-3874) Create a new Optimized Row Columnar file format for Hive

2013-01-11 Thread Owen O'Malley (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13551286#comment-13551286
 ] 

Owen O'Malley commented on HIVE-3874:
-

Sambavi, I should have a patch ready next week. Yes, the row groups (stripes) 
are 250MB by default. I currently set the HDFS block size for the files to 2 
times the stripe size, but I don't try to align them other than that.

> Create a new Optimized Row Columnar file format for Hive
> 
>
> Key: HIVE-3874
> URL: https://issues.apache.org/jira/browse/HIVE-3874
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
> Attachments: OrcFileIntro.pptx
>
>
> There are several limitations of the current RC File format that I'd like to 
> address by creating a new format:
> * each column value is stored as a binary blob, which means:
> ** the entire column value must be read, decompressed, and deserialized
> ** the file format can't use smarter type-specific compression
> ** push down filters can't be evaluated
> * the start of each row group needs to be found by scanning
> * user metadata can only be added to the file when the file is created
> * the file doesn't store the number of rows per a file or row group
> * there is no mechanism for seeking to a particular row number, which is 
> required for external indexes.
> * there is no mechanism for storing light weight indexes within the file to 
> enable push-down filters to skip entire row groups.
> * the type of the rows aren't stored in the file

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-3874) Create a new Optimized Row Columnar file format for Hive

2013-01-11 Thread Owen O'Malley (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13551248#comment-13551248
 ] 

Owen O'Malley commented on HIVE-3874:
-

Doug, of course Trevni could be modified arbitrarily to match the needs of 
Hive. But Hive will benefit more if there is a deep integration between the 
file format and the query engine. Both HBase and Accumulo have file formats 
that were originally based on Hadoop's TFile. But the need for integration with 
the query engine was such that their projects were better served by having the 
file format in their project rather than an upstream project. 

Of course the Avro project is free to copy any of the ORC code into Trevni, but 
Hive has the need to innovate in this area without asking Avro to make changes 
and waiting for them to be released. 

> Create a new Optimized Row Columnar file format for Hive
> 
>
> Key: HIVE-3874
> URL: https://issues.apache.org/jira/browse/HIVE-3874
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
> Attachments: OrcFileIntro.pptx
>
>
> There are several limitations of the current RC File format that I'd like to 
> address by creating a new format:
> * each column value is stored as a binary blob, which means:
> ** the entire column value must be read, decompressed, and deserialized
> ** the file format can't use smarter type-specific compression
> ** push down filters can't be evaluated
> * the start of each row group needs to be found by scanning
> * user metadata can only be added to the file when the file is created
> * the file doesn't store the number of rows per a file or row group
> * there is no mechanism for seeking to a particular row number, which is 
> required for external indexes.
> * there is no mechanism for storing light weight indexes within the file to 
> enable push-down filters to skip entire row groups.
> * the type of the rows aren't stored in the file

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-3874) Create a new Optimized Row Columnar file format for Hive

2013-01-11 Thread Owen O'Malley (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13551231#comment-13551231
 ] 

Owen O'Malley commented on HIVE-3874:
-

Namit, I'm using the table properties to manage the other features like 
compression, so I would probably make a table property like 'orc.create.index' 
or something. Would that make sense?

I should note that the indexes are very light. In a sample file:

* uncompressed text: 370MB
* compress orc: 86MB
* row index in orc: 140k

> Create a new Optimized Row Columnar file format for Hive
> 
>
> Key: HIVE-3874
> URL: https://issues.apache.org/jira/browse/HIVE-3874
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
> Attachments: OrcFileIntro.pptx
>
>
> There are several limitations of the current RC File format that I'd like to 
> address by creating a new format:
> * each column value is stored as a binary blob, which means:
> ** the entire column value must be read, decompressed, and deserialized
> ** the file format can't use smarter type-specific compression
> ** push down filters can't be evaluated
> * the start of each row group needs to be found by scanning
> * user metadata can only be added to the file when the file is created
> * the file doesn't store the number of rows per a file or row group
> * there is no mechanism for seeking to a particular row number, which is 
> required for external indexes.
> * there is no mechanism for storing light weight indexes within the file to 
> enable push-down filters to skip entire row groups.
> * the type of the rows aren't stored in the file

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HIVE-3889) Add floating point compression to ORC file

2013-01-11 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-3889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley updated HIVE-3889:


Attachment: fpc-impl.tar

This is the file that Karol emailed to me for me to submit to Apache.

> Add floating point compression to ORC file
> --
>
> Key: HIVE-3889
> URL: https://issues.apache.org/jira/browse/HIVE-3889
> Project: Hive
>  Issue Type: New Feature
>  Components: Serializers/Deserializers
>    Reporter: Owen O'Malley
>Assignee: Owen O'Malley
> Attachments: fpc-impl.tar
>
>
> Karol Wegrzycki, a CS student at University of Warsaw, has implemented an FPC 
> compressor for doubles. It would be great to hook this up to the ORC file 
> format so that we can get better compression for doubles.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HIVE-3889) Add floating point compression to ORC file

2013-01-11 Thread Owen O'Malley (JIRA)
Owen O'Malley created HIVE-3889:
---

 Summary: Add floating point compression to ORC file
 Key: HIVE-3889
 URL: https://issues.apache.org/jira/browse/HIVE-3889
 Project: Hive
  Issue Type: New Feature
  Components: Serializers/Deserializers
Reporter: Owen O'Malley
Assignee: Owen O'Malley


Karol Wegrzycki, a CS student at University of Warsaw, has implemented an FPC 
compressor for doubles. It would be great to hook this up to the ORC file 
format so that we can get better compression for doubles.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-3874) Create a new Optimized Row Columnar file format for Hive

2013-01-10 Thread Owen O'Malley (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13549839#comment-13549839
 ] 

Owen O'Malley commented on HIVE-3874:
-

Namit, for pure hive users there aren't any advantages of trevni over ORC.

> Create a new Optimized Row Columnar file format for Hive
> 
>
> Key: HIVE-3874
> URL: https://issues.apache.org/jira/browse/HIVE-3874
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
> Attachments: OrcFileIntro.pptx
>
>
> There are several limitations of the current RC File format that I'd like to 
> address by creating a new format:
> * each column value is stored as a binary blob, which means:
> ** the entire column value must be read, decompressed, and deserialized
> ** the file format can't use smarter type-specific compression
> ** push down filters can't be evaluated
> * the start of each row group needs to be found by scanning
> * user metadata can only be added to the file when the file is created
> * the file doesn't store the number of rows per a file or row group
> * there is no mechanism for seeking to a particular row number, which is 
> required for external indexes.
> * there is no mechanism for storing light weight indexes within the file to 
> enable push-down filters to skip entire row groups.
> * the type of the rows aren't stored in the file

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-3874) Create a new Optimized Row Columnar file format for Hive

2013-01-10 Thread Owen O'Malley (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13549784#comment-13549784
 ] 

Owen O'Malley commented on HIVE-3874:
-

Namit, I obviously did consider Trevni, but it didn't support some of the 
features that I wanted:
* using the hive type model
* more advanced encodings like dictionaries
* the ability to support push down predicates for skipping row groups
* running compression in block mode rather than streaming so that the reader 
can skip entire compression blocks

> Create a new Optimized Row Columnar file format for Hive
> 
>
> Key: HIVE-3874
> URL: https://issues.apache.org/jira/browse/HIVE-3874
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
> Attachments: OrcFileIntro.pptx
>
>
> There are several limitations of the current RC File format that I'd like to 
> address by creating a new format:
> * each column value is stored as a binary blob, which means:
> ** the entire column value must be read, decompressed, and deserialized
> ** the file format can't use smarter type-specific compression
> ** push down filters can't be evaluated
> * the start of each row group needs to be found by scanning
> * user metadata can only be added to the file when the file is created
> * the file doesn't store the number of rows per a file or row group
> * there is no mechanism for seeking to a particular row number, which is 
> required for external indexes.
> * there is no mechanism for storing light weight indexes within the file to 
> enable push-down filters to skip entire row groups.
> * the type of the rows aren't stored in the file

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-3874) Create a new Optimized Row Columnar file format for Hive

2013-01-10 Thread Owen O'Malley (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13549774#comment-13549774
 ] 

Owen O'Malley commented on HIVE-3874:
-

He Yongqiang, the APIs to the two formats are significantly different. It would 
be possible to extend the RCFile reader to recognize an ORC file and to have it 
delegate to the ORC File reader.

The other direction (having the ORC file reader parse an RCFile) isn't 
possible, because ORC provides operations that would be very expensive or 
impossible to implement in RCFile.

One concern with making the RCFile reader delegate to the ORC file reader is 
that RCFile returns binary values that are interpreted by the serde while in 
ORC deserialization happens in the reader. Therefore, either the adaptor would 
need to re-serialize the data or would require changes in the serde as well.

> Create a new Optimized Row Columnar file format for Hive
> 
>
> Key: HIVE-3874
> URL: https://issues.apache.org/jira/browse/HIVE-3874
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
> Attachments: OrcFileIntro.pptx
>
>
> There are several limitations of the current RC File format that I'd like to 
> address by creating a new format:
> * each column value is stored as a binary blob, which means:
> ** the entire column value must be read, decompressed, and deserialized
> ** the file format can't use smarter type-specific compression
> ** push down filters can't be evaluated
> * the start of each row group needs to be found by scanning
> * user metadata can only be added to the file when the file is created
> * the file doesn't store the number of rows per a file or row group
> * there is no mechanism for seeking to a particular row number, which is 
> required for external indexes.
> * there is no mechanism for storing light weight indexes within the file to 
> enable push-down filters to skip entire row groups.
> * the type of the rows aren't stored in the file

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HIVE-3874) Create a new Optimized Row Columnar file format for Hive

2013-01-09 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley updated HIVE-3874:


Attachment: OrcFileIntro.pptx

> Create a new Optimized Row Columnar file format for Hive
> 
>
> Key: HIVE-3874
> URL: https://issues.apache.org/jira/browse/HIVE-3874
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>    Reporter: Owen O'Malley
>Assignee: Owen O'Malley
> Attachments: OrcFileIntro.pptx
>
>
> There are several limitations of the current RC File format that I'd like to 
> address by creating a new format:
> * each column value is stored as a binary blob, which means:
> ** the entire column value must be read, decompressed, and deserialized
> ** the file format can't use smarter type-specific compression
> ** push down filters can't be evaluated
> * the start of each row group needs to be found by scanning
> * user metadata can only be added to the file when the file is created
> * the file doesn't store the number of rows per a file or row group
> * there is no mechanism for seeking to a particular row number, which is 
> required for external indexes.
> * there is no mechanism for storing light weight indexes within the file to 
> enable push-down filters to skip entire row groups.
> * the type of the rows aren't stored in the file

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-3874) Create a new Optimized Row Columnar file format for Hive

2013-01-09 Thread Owen O'Malley (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13548750#comment-13548750
 ] 

Owen O'Malley commented on HIVE-3874:
-

Namit,
  Yes, it has dictionary encoding for strings. The dictionary enables both 
better compression and makes push down filters much more efficient. The 
dictionaries are local to only the row group, so that row groups can be 
processed independently of each other. Currently, strings are always dictionary 
encoded, but it would make sense to allow the writer to pick whether the column 
should be encoded directly or using a dictionary.

> Create a new Optimized Row Columnar file format for Hive
> 
>
> Key: HIVE-3874
> URL: https://issues.apache.org/jira/browse/HIVE-3874
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
>
> There are several limitations of the current RC File format that I'd like to 
> address by creating a new format:
> * each column value is stored as a binary blob, which means:
> ** the entire column value must be read, decompressed, and deserialized
> ** the file format can't use smarter type-specific compression
> ** push down filters can't be evaluated
> * the start of each row group needs to be found by scanning
> * user metadata can only be added to the file when the file is created
> * the file doesn't store the number of rows per a file or row group
> * there is no mechanism for seeking to a particular row number, which is 
> required for external indexes.
> * there is no mechanism for storing light weight indexes within the file to 
> enable push-down filters to skip entire row groups.
> * the type of the rows aren't stored in the file

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HIVE-3874) Create a new Optimized Row Columnar file format for Hive

2013-01-09 Thread Owen O'Malley (JIRA)
Owen O'Malley created HIVE-3874:
---

 Summary: Create a new Optimized Row Columnar file format for Hive
 Key: HIVE-3874
 URL: https://issues.apache.org/jira/browse/HIVE-3874
 Project: Hive
  Issue Type: Improvement
  Components: Serializers/Deserializers
Reporter: Owen O'Malley
Assignee: Owen O'Malley


There are several limitations of the current RC File format that I'd like to 
address by creating a new format:
* each column value is stored as a binary blob, which means:
** the entire column value must be read, decompressed, and deserialized
** the file format can't use smarter type-specific compression
** push down filters can't be evaluated
* the start of each row group needs to be found by scanning
* user metadata can only be added to the file when the file is created
* the file doesn't store the number of rows per a file or row group
* there is no mechanism for seeking to a particular row number, which is 
required for external indexes.
* there is no mechanism for storing light weight indexes within the file to 
enable push-down filters to skip entire row groups.
* the type of the rows aren't stored in the file


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HIVE-3234) getting the reporter in the recordwriter

2012-11-13 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley updated HIVE-3234:


Status: Patch Available  (was: Open)

This patch passes in the real mapreduce reporter as the progressable for 
getHiveReportWriter. OutputFormats should still protect themselves from null in 
the Progressable, but the FileSinkOperator passes a Reporter from the mapreduce 
job.

> getting the reporter in the recordwriter
> 
>
> Key: HIVE-3234
> URL: https://issues.apache.org/jira/browse/HIVE-3234
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Affects Versions: 0.9.1
> Environment: any
>Reporter: Jimmy Hu
>Assignee: Owen O'Malley
>  Labels: newbie
> Fix For: 0.9.1
>
> Attachments: HIVE-3234.D6699.1.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> We would like to generate some custom statistics and report back to 
> map/reduce later wen implement the 
>  FileSinkOperator.RecordWriter interface. However, the current interface 
> design doesn't allow us to get the map reduce reporter object. Please extend 
> the current FileSinkOperator.RecordWriter interface so that it's close() 
> method passes in a map reduce reporter object. 
> For the same reason, please also extend the RecordReader interface too to 
> include a reporter object so that users can passes in custom map reduce  
> counters.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: hive 0.10 release

2012-11-08 Thread Owen O'Malley
+1


On Thu, Nov 8, 2012 at 3:18 PM, Carl Steinbach  wrote:

> +1
>
> On Wed, Nov 7, 2012 at 11:23 PM, Alexander Lorenz  >wrote:
>
> > +1, good karma
> >
> > On Nov 8, 2012, at 4:58 AM, Namit Jain  wrote:
> >
> > > +1 to the idea
> > >
> > > On 11/8/12 6:33 AM, "Edward Capriolo"  wrote:
> > >
> > >> That sounds good. I think this issue needs to be solved as well as
> > >> anything else that produces a bugus query result.
> > >>
> > >> https://issues.apache.org/jira/browse/HIVE-3083
> > >>
> > >> Edward
> > >>
> > >> On Wed, Nov 7, 2012 at 7:50 PM, Ashutosh Chauhan <
> hashut...@apache.org>
> > >> wrote:
> > >>> Hi,
> > >>>
> > >>> Its been a while since we released 0.10 more than six months ago. All
> > >>> this
> > >>> while, lot of action has happened with various cool features landing
> in
> > >>> trunk. Additionally, I am looking forward to HiveServer2 landing in
> > >>> trunk.  So, I propose that we cut the branch for 0.10 soon afterwards
> > >>> and
> > >>> than release it. Thoughts?
> > >>>
> > >>> Thanks,
> > >>> Ashutosh
> > >
> >
> > --
> > Alexander Alten-Lorenz
> > http://mapredit.blogspot.com
> > German Hadoop LinkedIn Group: http://goo.gl/N8pCF
> >
> >
>


[jira] [Created] (HIVE-3660) Improve OutputFormat for Hive

2012-11-02 Thread Owen O'Malley (JIRA)
Owen O'Malley created HIVE-3660:
---

 Summary: Improve OutputFormat for Hive
 Key: HIVE-3660
 URL: https://issues.apache.org/jira/browse/HIVE-3660
 Project: Hive
  Issue Type: Improvement
  Components: Serializers/Deserializers
Reporter: Owen O'Malley
Assignee: Owen O'Malley


Hive's output formats are currently given a list of binary blobs to store, 
which severely limits the options for file formats. I'd like to create a new 
OutputFormat interface that provides:
* table properties
* object inspector for the row
* type info for the row

The RecordWriter would be passed the internal row object.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HIVE-3599) missing return of compression codec to pool

2012-10-18 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-3599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley updated HIVE-3599:


Status: Patch Available  (was: Open)

> missing return of compression codec to pool
> ---
>
> Key: HIVE-3599
> URL: https://issues.apache.org/jira/browse/HIVE-3599
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor
>    Reporter: Owen O'Malley
>Assignee: Owen O'Malley
> Attachments: hive-3599.patch
>
>
> The RCFile writer is currently missing a call to return of one of the 
> compression codecs to the pool.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HIVE-3599) missing return of compression codec to pool

2012-10-18 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-3599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley updated HIVE-3599:


Attachment: hive-3599.patch

Here's the obvious fix. There aren't any functional difference.

> missing return of compression codec to pool
> ---
>
> Key: HIVE-3599
> URL: https://issues.apache.org/jira/browse/HIVE-3599
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
> Attachments: hive-3599.patch
>
>
> The RCFile writer is currently missing a call to return of one of the 
> compression codecs to the pool.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HIVE-3599) missing return of compression codec to pool

2012-10-18 Thread Owen O'Malley (JIRA)
Owen O'Malley created HIVE-3599:
---

 Summary: missing return of compression codec to pool
 Key: HIVE-3599
 URL: https://issues.apache.org/jira/browse/HIVE-3599
 Project: Hive
  Issue Type: Bug
  Components: Query Processor
Reporter: Owen O'Malley


The RCFile writer is currently missing a call to return of one of the 
compression codecs to the pool.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (HIVE-3599) missing return of compression codec to pool

2012-10-18 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-3599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley reassigned HIVE-3599:
---

Assignee: Owen O'Malley

> missing return of compression codec to pool
> ---
>
> Key: HIVE-3599
> URL: https://issues.apache.org/jira/browse/HIVE-3599
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
>
> The RCFile writer is currently missing a call to return of one of the 
> compression codecs to the pool.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: non map-reduce for simple queries

2012-07-31 Thread Owen O'Malley
On Mon, Jul 30, 2012 at 11:38 PM, Namit Jain  wrote:

> That would be difficult. The % done can be estimated from the data already
> read.
>

I'm confused. Wouldn't the maximum size of the data remaining over the
maximum size of the original query give a reasonable approximation of the
amount of work done?


>
> It might be simpler to have a check like: if the query isn't done in
> the first 5 seconds of running locally, you switch to mapreduce.
>

There are three problems I see:
  * If the query is 95% done at 5 seconds,  it is a shame to kill it and
start over again at 0% on mapreduce with a much longer latency. (Instead of
spending the additional 0.25 seconds you spend an additional 60+.)
  * You can't print anything until you know whether you are going to kill
it or not. (The mapreduce results might come back in a different order)
With user-facing programs, it is much better to start printing early
instead of later since it gives faster feedback to the user.
  * It isn't predictable how the query will run. That makes it very hard to
build applications on top of Hive.

Do those make sense?


Re: non map-reduce for simple queries

2012-07-30 Thread Owen O'Malley
On Mon, Jul 30, 2012 at 9:12 PM, Namit Jain  wrote:

> The total number of bytes of the input will be used to determine whether
> to not launch a map-reduce job for this
> query. That was in my original mail.
>
> However, given any complex where condition and the lack of column
> statistics in hive, we cannot determine the
> number of bytes that would be needed to satisfy the where condition.


All of these are heuristics are guidelines, clearly. My inclination would
be to use the maximum data volume as the primary metric until we have a
better understanding of cases where that doesn't work well. If we are going
to try the local solution and fall back to mapreduce, it seems better to
put a limit well short of being done so that you don't waste as much work.
Perhaps, if the query isn't 10% done in the first 5 seconds of running
locally, you switch to mapreduce. Would that work?

-- Owen


Re: non map-reduce for simple queries

2012-07-30 Thread Owen O'Malley
On Sat, Jul 28, 2012 at 6:17 PM, Navis류승우  wrote:

> I was thinking of timeout for fetching, 2000msec for example. How about
> that?
>

Instead of time, which requires launching the query and letting it timeout,
how about determining the number of bytes that would need to be fetched to
the local box? Limiting it to 100 or 200 mb seems reasonable.

-- Owen


[jira] [Commented] (HIVE-3153) Release codecs and output streams between flushes of RCFile

2012-07-26 Thread Owen O'Malley (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13423464#comment-13423464
 ] 

Owen O'Malley commented on HIVE-3153:
-

I also wrote a test program that just writes to a large number of 
RCFile.Writers. With the patch, I was able to use a lot more Writers before I 
ran out of memory in the process.

> Release codecs and output streams between flushes of RCFile
> ---
>
> Key: HIVE-3153
> URL: https://issues.apache.org/jira/browse/HIVE-3153
> Project: Hive
>  Issue Type: Improvement
>  Components: Compression
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
> Attachments: hive-3153.patch
>
>
> Currently, RCFile writer holds a compression codec per a file and a 
> compression output stream per a column. Especially for queries that use 
> dynamic partitions this quickly consumes a lot of memory.
> I'd like flushRecords to get a codec from the pool and create the compression 
> output stream in flushRecords.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HIVE-3153) Release codecs and output streams between flushes of RCFile

2012-07-26 Thread Owen O'Malley (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13423451#comment-13423451
 ] 

Owen O'Malley commented on HIVE-3153:
-

The use case that this helps is the one with a relatively (~2000) dynamic 
partitions per a reducer. In that case it will have an open RCFile.Writer per a 
dynamic partition, but they aren't being flushed in parallel. By moving the 
extra buffers and compression codecs so that they are acquired only when they 
are needed for flush instead of during the whole lifespan of the Writer, I'm 
able to keep a lot more Writers open at once.

> Release codecs and output streams between flushes of RCFile
> ---
>
> Key: HIVE-3153
> URL: https://issues.apache.org/jira/browse/HIVE-3153
> Project: Hive
>  Issue Type: Improvement
>  Components: Compression
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
> Attachments: hive-3153.patch
>
>
> Currently, RCFile writer holds a compression codec per a file and a 
> compression output stream per a column. Especially for queries that use 
> dynamic partitions this quickly consumes a lot of memory.
> I'd like flushRecords to get a codec from the pool and create the compression 
> output stream in flushRecords.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HIVE-3153) Release codecs and output streams between flushes of RCFile

2012-07-23 Thread Owen O'Malley (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13421023#comment-13421023
 ] 

Owen O'Malley commented on HIVE-3153:
-

I just posted this as https://reviews.facebook.net/D4299 .

> Release codecs and output streams between flushes of RCFile
> ---
>
> Key: HIVE-3153
> URL: https://issues.apache.org/jira/browse/HIVE-3153
> Project: Hive
>  Issue Type: Improvement
>  Components: Compression
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
> Attachments: hive-3153.patch
>
>
> Currently, RCFile writer holds a compression codec per a file and a 
> compression output stream per a column. Especially for queries that use 
> dynamic partitions this quickly consumes a lot of memory.
> I'd like flushRecords to get a codec from the pool and create the compression 
> output stream in flushRecords.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Assigned] (HIVE-3234) getting the reporter in the recordwriter

2012-07-05 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley reassigned HIVE-3234:
---

Assignee: Owen O'Malley

> getting the reporter in the recordwriter
> 
>
> Key: HIVE-3234
> URL: https://issues.apache.org/jira/browse/HIVE-3234
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Affects Versions: 0.9.1
> Environment: any
>    Reporter: Jimmy Hu
>Assignee: Owen O'Malley
>  Labels: newbie
> Fix For: 0.9.1
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> We would like to generate some custom statistics and report back to 
> map/reduce later wen implement the 
>  FileSinkOperator.RecordWriter interface. However, the current interface 
> design doesn't allow us to get the map reduce reporter object. Please extend 
> the current FileSinkOperator.RecordWriter interface so that it's close() 
> method passes in a map reduce reporter object. 
> For the same reason, please also extend the RecordReader interface too to 
> include a reporter object so that users can passes in custom map reduce  
> counters.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HIVE-3098) Memory leak from large number of FileSystem instances in FileSystem.CACHE. (Must cache UGIs.)

2012-06-28 Thread Owen O'Malley (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13403579#comment-13403579
 ] 

Owen O'Malley commented on HIVE-3098:
-

Alejandro,
  Daryn is absolutely right that we can't make the Subjects immutable. We need 
to be able to update a Subject with update Kerberos tickets and Tokens and 
changing that would break a lot of other code.

It would probably make sense to make a UGI.doAsAndCleanup that does a doAs and 
then removes all filesystems based on the ugi, since clearly most of the Hadoop 
ecosystem servers have related problems.

> Memory leak from large number of FileSystem instances in FileSystem.CACHE. 
> (Must cache UGIs.)
> -
>
> Key: HIVE-3098
> URL: https://issues.apache.org/jira/browse/HIVE-3098
> Project: Hive
>  Issue Type: Bug
>  Components: Shims
>Affects Versions: 0.9.0
> Environment: Running with Hadoop 20.205.0.3+ / 1.0.x with security 
> turned on.
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
> Attachments: HIVE-3098.patch
>
>
> The problem manifested from stress-testing HCatalog 0.4.1 (as part of testing 
> the Oracle backend).
> The HCatalog server ran out of memory (-Xmx2048m) when pounded by 60-threads, 
> in under 24 hours. The heap-dump indicates that hadoop::FileSystem.CACHE had 
> 100 instances of FileSystem, whose combined retained-mem consumed the 
> entire heap.
> It boiled down to hadoop::UserGroupInformation::equals() being implemented 
> such that the "Subject" member is compared for equality ("=="), and not 
> equivalence (".equals()"). This causes equivalent UGI instances to compare as 
> unequal, and causes a new FileSystem instance to be created and cached.
> The UGI.equals() is so implemented, incidentally, as a fix for yet another 
> problem (HADOOP-6670); so it is unlikely that that implementation can be 
> modified.
> The solution for this is to check for UGI equivalence in HCatalog (i.e. in 
> the Hive metastore), using an cache for UGI instances in the shims.
> I have a patch to fix this. I'll upload it shortly. I just ran an overnight 
> test to confirm that the memory-leak has been arrested.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HIVE-3153) Release codecs and output streams between flushes of RCFile

2012-06-25 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-3153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley updated HIVE-3153:


Attachment: hive-3153.patch

This patch:
 * Fixes some javadoc
 * Suppresses some unused warnings
 * I deprecated some of the unused public functions that don't seem to be 
important parts of the API.
 * Reduces the memory footprint of the Writer to just the array of 
ColumnBuffers.

With this patch, I'm able to write to use many more parallel writers in the 
same memory footprint.

> Release codecs and output streams between flushes of RCFile
> ---
>
> Key: HIVE-3153
> URL: https://issues.apache.org/jira/browse/HIVE-3153
> Project: Hive
>  Issue Type: Improvement
>  Components: Compression
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
> Attachments: hive-3153.patch
>
>
> Currently, RCFile writer holds a compression codec per a file and a 
> compression output stream per a column. Especially for queries that use 
> dynamic partitions this quickly consumes a lot of memory.
> I'd like flushRecords to get a codec from the pool and create the compression 
> output stream in flushRecords.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HIVE-3153) Release codecs and output streams between flushes of RCFile

2012-06-25 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-3153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley updated HIVE-3153:


Status: Patch Available  (was: Open)

> Release codecs and output streams between flushes of RCFile
> ---
>
> Key: HIVE-3153
> URL: https://issues.apache.org/jira/browse/HIVE-3153
> Project: Hive
>  Issue Type: Improvement
>  Components: Compression
>    Reporter: Owen O'Malley
>Assignee: Owen O'Malley
> Attachments: hive-3153.patch
>
>
> Currently, RCFile writer holds a compression codec per a file and a 
> compression output stream per a column. Especially for queries that use 
> dynamic partitions this quickly consumes a lot of memory.
> I'd like flushRecords to get a codec from the pool and create the compression 
> output stream in flushRecords.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




<    4   5   6   7   8   9   10   >