[jira] [Created] (HIVE-25400) Move the offset updating in BytesColumnVector to setValPreallocated.
Owen O'Malley created HIVE-25400: Summary: Move the offset updating in BytesColumnVector to setValPreallocated. Key: HIVE-25400 URL: https://issues.apache.org/jira/browse/HIVE-25400 Project: Hive Issue Type: Bug Reporter: Owen O'Malley Assignee: Owen O'Malley Fix For: storage-2.7.3, storage-2.8.1, storage-2.9.0 HIVE-25190 changed the semantics of BytesColumnVector so that ensureValPreallocated reserved the room, which interacted badly with ORC's redact mask code. The redact mask code needs to be able to increase the allocation as it goes so it can call the ensureValPreallocated multiple times. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-25190) BytesColumnVector fails when the aggregate size is > 1gb
Owen O'Malley created HIVE-25190: Summary: BytesColumnVector fails when the aggregate size is > 1gb Key: HIVE-25190 URL: https://issues.apache.org/jira/browse/HIVE-25190 Project: Hive Issue Type: Bug Reporter: Owen O'Malley Assignee: Owen O'Malley Currently, BytesColumnVector will allocate a buffer for small values (< 1mb), but fail with: {code:java} new RuntimeException("Overflow of newLength. smallBuffer.length=" + smallBuffer.length + ", nextElemLength=" + nextElemLength); {code:java} if the aggregate size of the buffer crosses over 1gb. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-24458) Allow access to SArgs without converting to disjunctive normal form
Owen O'Malley created HIVE-24458: Summary: Allow access to SArgs without converting to disjunctive normal form Key: HIVE-24458 URL: https://issues.apache.org/jira/browse/HIVE-24458 Project: Hive Issue Type: Improvement Reporter: Owen O'Malley Assignee: Owen O'Malley For some use cases, it is useful to have access to the SArg expression in a non-normalized form. Currently, the SArg only provides the fully normalized expression. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-24455) Fix broken junit framework in storage-api
Owen O'Malley created HIVE-24455: Summary: Fix broken junit framework in storage-api Key: HIVE-24455 URL: https://issues.apache.org/jira/browse/HIVE-24455 Project: Hive Issue Type: Bug Reporter: Owen O'Malley Assignee: Owen O'Malley The use of junit is broken in storage-api. It results in no tests being found. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-23215) Make FilterContext and MutableFilterContext interfaces
Owen O'Malley created HIVE-23215: Summary: Make FilterContext and MutableFilterContext interfaces Key: HIVE-23215 URL: https://issues.apache.org/jira/browse/HIVE-23215 Project: Hive Issue Type: Bug Components: storage-api Reporter: Owen O'Malley Assignee: Owen O'Malley HIVE-22959 introduced FilterContext to support ORC-577. The duplication of fields between the FilterContext and VectorizedRowBatch seems likely to cause user confusion. This patch makes them interfaces that VectorizedRowBatch implements. Thus, there is a single copy of the data and no need to copy them back and forth. LLAP can make its own implementation of the interfaces if it doesn't want to use VectorizedRowBatch. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-22405) Add ColumnVector support for ProlepticCalendar
Owen O'Malley created HIVE-22405: Summary: Add ColumnVector support for ProlepticCalendar Key: HIVE-22405 URL: https://issues.apache.org/jira/browse/HIVE-22405 Project: Hive Issue Type: Bug Reporter: Owen O'Malley Hive recently moved its processing to the proleptic calendar, which has created some issues for users who have dates before 1580 AD. I'd propose extending the column vectors for times & dates to encode which calendar they are using. * create DateColumnVector that extends LongColumnVector * add a method to change calendars to both DateColumnVector and TimestampColumnVector. {code} /** * Change the calendar to or from proleptic. If the new and old values of the flag are the * same, nothing is done. * useProleptic - set the flag for the proleptic calendar * updateData - change the data to match the new value of the flag. */ void changeCalendar(useProleptic: boolean, updateData: boolean); /** * Detect whether this data is using the proleptic calendar. */ boolean usingProlepticCalendar(); {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-22105) Update ORC to 1.5.6.
Owen O'Malley created HIVE-22105: Summary: Update ORC to 1.5.6. Key: HIVE-22105 URL: https://issues.apache.org/jira/browse/HIVE-22105 Project: Hive Issue Type: Bug Reporter: Owen O'Malley Assignee: Owen O'Malley ORC has had some important fixes in the 1.5 branch and they should be picked up by Hive. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (HIVE-21585) Upgrade branch-2.3 to ORC 1.3.4
Owen O'Malley created HIVE-21585: Summary: Upgrade branch-2.3 to ORC 1.3.4 Key: HIVE-21585 URL: https://issues.apache.org/jira/browse/HIVE-21585 Project: Hive Issue Type: Bug Reporter: Owen O'Malley Assignee: Owen O'Malley Hive's branch-2.3 currently uses ORC 1.3.3. I'd like to upgrade it use the bug fix release [ORC 1.3.4|https://issues.apache.org/jira/sr/jira.issueviews:searchrequest-printable/temp/SearchRequest.html?jqlQuery=project+%3D+ORC+AND+status+%3D+Closed+AND+fixVersion+%3D+%221.3.4%22=500]. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-20135) Fix incompatible change in TimestampColumnVector to default to UTC
Owen O'Malley created HIVE-20135: Summary: Fix incompatible change in TimestampColumnVector to default to UTC Key: HIVE-20135 URL: https://issues.apache.org/jira/browse/HIVE-20135 Project: Hive Issue Type: Improvement Reporter: Owen O'Malley Assignee: Jesus Camacho Rodriguez HIVE-20007 changed the default for TimestampColumnVector to be to use UTC, which breaks the API compatibility with storage-api 2.6. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-19013) Fix some minor build issues in storage-api
Owen O'Malley created HIVE-19013: Summary: Fix some minor build issues in storage-api Key: HIVE-19013 URL: https://issues.apache.org/jira/browse/HIVE-19013 Project: Hive Issue Type: Bug Components: storage-api Reporter: Owen O'Malley Assignee: Owen O'Malley Currently, the storage-api tests complain that there isn't a log4j2.xml and the javadoc fails. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-17925) Fix TestHooks so that it avoids ClassNotFound on teardown
Owen O'Malley created HIVE-17925: Summary: Fix TestHooks so that it avoids ClassNotFound on teardown Key: HIVE-17925 URL: https://issues.apache.org/jira/browse/HIVE-17925 Project: Hive Issue Type: Bug Reporter: Owen O'Malley Assignee: Owen O'Malley TestHooks gets a ClassNotFound exception during teardown, which messes up some following tests. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVE-17924) Restore SerDe by reverting HIVE-15167 to unbreak API compatibility
Owen O'Malley created HIVE-17924: Summary: Restore SerDe by reverting HIVE-15167 to unbreak API compatibility Key: HIVE-17924 URL: https://issues.apache.org/jira/browse/HIVE-17924 Project: Hive Issue Type: Bug Affects Versions: 2.3.0, 2.3.1 Reporter: Owen O'Malley Assignee: Owen O'Malley HIVE-15167 broke compatibility badly for very little gain and caused a lot of pain for our users. We should revert it and restore the SerDe interface. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVE-17173) Add some connivence redirects to the Hive site
Owen O'Malley created HIVE-17173: Summary: Add some connivence redirects to the Hive site Key: HIVE-17173 URL: https://issues.apache.org/jira/browse/HIVE-17173 Project: Hive Issue Type: Improvement Reporter: Owen O'Malley Assignee: Owen O'Malley I'd propose that we add the following redirects to our site's .htaccess: * http://hive.apache.org/bugs -> https://issues.apache.org/jira/browse/hive * http://hive.apache.org/downloads -> https://www.apache.org/dyn/closer.cgi/hive/ * http://hive.apache.org/releases -> https://hive.apache.org/docs/downloads.html * http://hive.apache.org/src -> https://github.com/apache/hive * http://hive.apache.org/web-src -> https://svn.apache.org/repos/asf/hive/cms/trunk Thoughts? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVE-17171) Remove old javadoc versions
Owen O'Malley created HIVE-17171: Summary: Remove old javadoc versions Key: HIVE-17171 URL: https://issues.apache.org/jira/browse/HIVE-17171 Project: Hive Issue Type: Improvement Reporter: Owen O'Malley We currently have a lot of old javadoc versions. I'd propose that we keep the following versions: * r1.2.2 * r2.1.1 * r2.2.0 (Note that 2.3.0 was not checked in to the site.) In particular, I'd suggest we remove: * hcat-r0.5.0 * r0.10.0 * r0.11.0 * r0.12.0 * r0.13.1 * r1.0.1 * r1.1.1 * r2.0.1 Any concerns? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVE-17154) fix rat problems in branch-2.2
Owen O'Malley created HIVE-17154: Summary: fix rat problems in branch-2.2 Key: HIVE-17154 URL: https://issues.apache.org/jira/browse/HIVE-17154 Project: Hive Issue Type: Bug Reporter: Owen O'Malley Assignee: Owen O'Malley Fix rat problems in the branch-2.2. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVE-17118) Clean up of HIVE-14309 to move the orc source code to org.apache.hive.orc
Owen O'Malley created HIVE-17118: Summary: Clean up of HIVE-14309 to move the orc source code to org.apache.hive.orc Key: HIVE-17118 URL: https://issues.apache.org/jira/browse/HIVE-17118 Project: Hive Issue Type: Bug Components: ORC Reporter: Owen O'Malley Assignee: Owen O'Malley Fix For: 2.2.0 Just for branch-2.2. HIVE-14309 shaded the hive-orc jar to use a unique package org.apache.hive.orc package. This patch moves the source files over to the right directory and removes the shading. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVE-16787) Fix itests in branch-2.2
Owen O'Malley created HIVE-16787: Summary: Fix itests in branch-2.2 Key: HIVE-16787 URL: https://issues.apache.org/jira/browse/HIVE-16787 Project: Hive Issue Type: Bug Reporter: Owen O'Malley Assignee: Owen O'Malley Fix For: 2.2.0 The itests are broken in branch 2.2 and need to be fixed before release. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-16683) ORC WriterVersion gets ArrayIndexOutOfBoundsException on newer ORC files
Owen O'Malley created HIVE-16683: Summary: ORC WriterVersion gets ArrayIndexOutOfBoundsException on newer ORC files Key: HIVE-16683 URL: https://issues.apache.org/jira/browse/HIVE-16683 Project: Hive Issue Type: Bug Components: ORC Affects Versions: 2.1.1, 2.2.0 Reporter: Owen O'Malley Assignee: Owen O'Malley This only impacts branch-2.1 and branch-2.2, because it has been fixed in the ORC project's code base via ORC-125. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-16549) Fix an incompatible change in PredicateLeafImpl from HIVE-15269
Owen O'Malley created HIVE-16549: Summary: Fix an incompatible change in PredicateLeafImpl from HIVE-15269 Key: HIVE-16549 URL: https://issues.apache.org/jira/browse/HIVE-16549 Project: Hive Issue Type: Bug Reporter: Owen O'Malley Assignee: Owen O'Malley HIVE-15269 added a parameter to the constructor for PredicateLeafImpl for a configuration object. The configuration object is only used for the new LiteralDelegates. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-15929) Fix HiveDecimalWritable
Owen O'Malley created HIVE-15929: Summary: Fix HiveDecimalWritable Key: HIVE-15929 URL: https://issues.apache.org/jira/browse/HIVE-15929 Project: Hive Issue Type: Bug Reporter: Owen O'Malley HIVE-15335 broke compatibility with Hive 2.1 by making HiveDecimalWritable.getInternalStorate() throw an exception when called on an unset value. It is easy to instead return an empty array, which will allow the old code to allocate a new array. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-15922) SchemaEvolution must guarantee that getFileIncluded is not null
Owen O'Malley created HIVE-15922: Summary: SchemaEvolution must guarantee that getFileIncluded is not null Key: HIVE-15922 URL: https://issues.apache.org/jira/browse/HIVE-15922 Project: Hive Issue Type: Bug Components: ORC Affects Versions: 2.1.1 Reporter: Owen O'Malley Fix For: 2.1.2 This only impacts branch-2.1, because it is already fixed in master by HIVE-14007. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-15841) Upgrade Hive to ORC 1.3.2
Owen O'Malley created HIVE-15841: Summary: Upgrade Hive to ORC 1.3.2 Key: HIVE-15841 URL: https://issues.apache.org/jira/browse/HIVE-15841 Project: Hive Issue Type: Bug Reporter: Owen O'Malley Hive needs ORC-141 and ORC-135, so we should upgrade to ORC-1.3.2 once it releases. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-15643) remove use of default charset in FastHiveDecimal
Owen O'Malley created HIVE-15643: Summary: remove use of default charset in FastHiveDecimal Key: HIVE-15643 URL: https://issues.apache.org/jira/browse/HIVE-15643 Project: Hive Issue Type: Bug Reporter: Owen O'Malley HIVE-15335 introduced some new uses of String.getBytes(), which uses the default char set. These need to be replaced with the version that always uses UTF8. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-15419) Separate out storage-api to be released independently
Owen O'Malley created HIVE-15419: Summary: Separate out storage-api to be released independently Key: HIVE-15419 URL: https://issues.apache.org/jira/browse/HIVE-15419 Project: Hive Issue Type: Task Components: storage-api Reporter: Owen O'Malley Currently, the Hive project releases a single monolithic release, but this makes file formats reading directly into Hive's vector row batches a circular dependence. Storage-api is a small module with the vectorized row batches and SearchArgument that are necessary for efficient vectorized read and write. By releasing storage-api independently, we can make an interface that the file formats can read and write from. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-15375) Port ORC-115 to storage-api
Owen O'Malley created HIVE-15375: Summary: Port ORC-115 to storage-api Key: HIVE-15375 URL: https://issues.apache.org/jira/browse/HIVE-15375 Project: Hive Issue Type: Improvement Reporter: Owen O'Malley Assignee: Owen O'Malley Currently, VectorizedRowBatch.toString() assumes that all BytesColumnVector's use the internal buffer for all of the values. This leads to incorrect strings in many common cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-15124) Fix OrcInputFormat to use reader's schema for include boolean array
Owen O'Malley created HIVE-15124: Summary: Fix OrcInputFormat to use reader's schema for include boolean array Key: HIVE-15124 URL: https://issues.apache.org/jira/browse/HIVE-15124 Project: Hive Issue Type: Bug Components: ORC Affects Versions: 2.1.0 Reporter: Owen O'Malley Assignee: Owen O'Malley Currently, the OrcInputFormat uses the file's schema rather than the reader's schema. This means that SchemaEvolution fails with an ArrayIndexOutOfBoundsException if a partition has a different schema than the table. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-14309) Fix naming of classes in orc module to not conflict with standalone orc
Owen O'Malley created HIVE-14309: Summary: Fix naming of classes in orc module to not conflict with standalone orc Key: HIVE-14309 URL: https://issues.apache.org/jira/browse/HIVE-14309 Project: Hive Issue Type: Bug Reporter: Owen O'Malley Assignee: Owen O'Malley The current Hive 2.0 and 2.1 releases have classes in the org.apache.orc namespace that clash with the ORC project's classes. From Hive 2.2 onward, the classes will only be on ORC, but we'll reduce the problems of classpath issues if we rename the classes to org.apache.hive.orc. I've looked at a set of projects (pig, spark, oozie, flume, & storm) and can't find any uses of Hive's versions of the org.apache.orc classes, so I believe this is a safe change that will reduce the integration problems down stream. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-14242) Backport ORC-53 to Hive
Owen O'Malley created HIVE-14242: Summary: Backport ORC-53 to Hive Key: HIVE-14242 URL: https://issues.apache.org/jira/browse/HIVE-14242 Project: Hive Issue Type: Bug Components: ORC Reporter: Owen O'Malley Assignee: Owen O'Malley ORC-53 was mostly about the mapreduce shims for ORC, but it fixed a problem in TypeDescription that should be backported to Hive. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-14220) Protected users from Reader.rows(Options) modifying the Options object
Owen O'Malley created HIVE-14220: Summary: Protected users from Reader.rows(Options) modifying the Options object Key: HIVE-14220 URL: https://issues.apache.org/jira/browse/HIVE-14220 Project: Hive Issue Type: Bug Reporter: Owen O'Malley Assignee: Owen O'Malley This is a matching fix to HIVE-14004 where ACID was getting in to trouble because it was reusing the Reader.Options argument between files and Reader.rows was modifying it. HIVE-14004 just fixed the Hive case, but we need a corresponding fix over here. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-14166) Minor updates to the website.
Owen O'Malley created HIVE-14166: Summary: Minor updates to the website. Key: HIVE-14166 URL: https://issues.apache.org/jira/browse/HIVE-14166 Project: Hive Issue Type: Bug Reporter: Owen O'Malley Assignee: Owen O'Malley Minor updates to the website & documentation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-14007) Replace ORC module with ORC release
Owen O'Malley created HIVE-14007: Summary: Replace ORC module with ORC release Key: HIVE-14007 URL: https://issues.apache.org/jira/browse/HIVE-14007 Project: Hive Issue Type: Bug Components: ORC Affects Versions: 2.2.0 Reporter: Owen O'Malley Assignee: Owen O'Malley Fix For: 2.2.0 This completes moving the core ORC reader & writer to the ORC project. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-13906) Remove guava dependence from storage-api module
Owen O'Malley created HIVE-13906: Summary: Remove guava dependence from storage-api module Key: HIVE-13906 URL: https://issues.apache.org/jira/browse/HIVE-13906 Project: Hive Issue Type: Bug Components: storage-api Reporter: Owen O'Malley Assignee: Owen O'Malley Guava is a very problematic library to depend on because of the version incompatibilities and the use of it in the storage-api module causes it to leak into everything that depends on it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-13763) Update smart-apply-patch.sh with ability to use patches from git
Owen O'Malley created HIVE-13763: Summary: Update smart-apply-patch.sh with ability to use patches from git Key: HIVE-13763 URL: https://issues.apache.org/jira/browse/HIVE-13763 Project: Hive Issue Type: Improvement Reporter: Owen O'Malley Assignee: Owen O'Malley Currently, the smart-apply-patch.sh doesn't understand git patches. It is relatively easy to make it understand patches generated by: {code} % git format-patch apache/master --stdout > HIVE-999.patch {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-13464) Backport changes to storage-api into branch 2 for release into 2.0.1
Owen O'Malley created HIVE-13464: Summary: Backport changes to storage-api into branch 2 for release into 2.0.1 Key: HIVE-13464 URL: https://issues.apache.org/jira/browse/HIVE-13464 Project: Hive Issue Type: Bug Components: storage-api Reporter: Owen O'Malley Assignee: Owen O'Malley Fix For: 2.0.1 To release ORC as a separate project, backporting the safe changes for storage-api to 2.0.1 will minimize the disruption. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-13232) Aggressively drop compression buffers in ORC OutStreams
Owen O'Malley created HIVE-13232: Summary: Aggressively drop compression buffers in ORC OutStreams Key: HIVE-13232 URL: https://issues.apache.org/jira/browse/HIVE-13232 Project: Hive Issue Type: Bug Components: ORC Reporter: Owen O'Malley Assignee: Owen O'Malley In Hive 0.11, when ORC's OutStream's were flushed they dropped all of the their buffers. In the patch for HIVE-4342, we inadvertently changed that behavior so that one of the buffers is held on to. For queries with a lot of writers and thus under significant memory pressure this can have a significant impact on the memory usage. Note that "hive.optimize.sort.dynamic.partition" avoids this problem by sorting on the dynamic partition key and thus only a single ORC writer is open at once. This will use memory more effectively and avoid creating ORC files with very small stripes, which will produce better downstream performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-12838) Add methods for getting and storing serialized ORC file tails
Owen O'Malley created HIVE-12838: Summary: Add methods for getting and storing serialized ORC file tails Key: HIVE-12838 URL: https://issues.apache.org/jira/browse/HIVE-12838 Project: Hive Issue Type: Sub-task Reporter: Owen O'Malley Assignee: Owen O'Malley Provide a pair of routines for getting and restoring from a serialized file footer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-12638) Hive should not create empty files in partitions
Owen O'Malley created HIVE-12638: Summary: Hive should not create empty files in partitions Key: HIVE-12638 URL: https://issues.apache.org/jira/browse/HIVE-12638 Project: Hive Issue Type: Bug Components: File Formats Reporter: Owen O'Malley Currently Hive creates empty files for buckets with no rows in a directory. I believe this was originally because the SMB and bucket join require files to be present to get InputSplits. There are customers where this behavior leads the creation of more 200,000 empty ORC files per an hour on a cluster (with peaks of more than 725,000 per an hour). We've also seen instances where a single DataNode is involved in 5600 of these empty ORC files within a 2 minute period. This causes significant stress on HDFS at both the NameNode and DataNode and is completely unnecessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-12571) Push TypeDescription in to the ReaderImpl and RecordReaderImpl
Owen O'Malley created HIVE-12571: Summary: Push TypeDescription in to the ReaderImpl and RecordReaderImpl Key: HIVE-12571 URL: https://issues.apache.org/jira/browse/HIVE-12571 Project: Hive Issue Type: Sub-task Reporter: Owen O'Malley We want to use the TypeDescription rather than List because it gives us a much better interface. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-12286) Add option to ORC vectorized reader to not trim spaces from char columns.
Owen O'Malley created HIVE-12286: Summary: Add option to ORC vectorized reader to not trim spaces from char columns. Key: HIVE-12286 URL: https://issues.apache.org/jira/browse/HIVE-12286 Project: Hive Issue Type: Sub-task Reporter: Owen O'Malley Currently the ORC reader in nextBatch always strips spaces from char columns. It is more natural for non-Hive applications to make it not trim the results on read, so I propose adding a switch to ReaderOptions. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-12159) Create vectorized readers for the complex types
Owen O'Malley created HIVE-12159: Summary: Create vectorized readers for the complex types Key: HIVE-12159 URL: https://issues.apache.org/jira/browse/HIVE-12159 Project: Hive Issue Type: Sub-task Reporter: Owen O'Malley Assignee: Owen O'Malley We need vectorized readers for the complex types. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-12066) Add javadoc for methods added to public APIs
Owen O'Malley created HIVE-12066: Summary: Add javadoc for methods added to public APIs Key: HIVE-12066 URL: https://issues.apache.org/jira/browse/HIVE-12066 Project: Hive Issue Type: Sub-task Reporter: Owen O'Malley Assignee: Sergey Shelukhin Looking through the changes for ORC, there are methods being added without documentation: {code} --- ql/src/java/org/apache/hadoop/hive/ql/io/orc/Reader.java +++ ql/src/java/org/apache/hadoop/hive/ql/io/orc/Reader.java @@ -360,8 +353,18 @@ RecordReader rows(long offset, long length, MetadataReader metadata() throws IOException; + List getVersionList(); + + int getMetadataSize(); + + List getOrcProtoStripeStatistics(); + + List getStripeStatistics(); + + List getOrcProtoFileStatistics(); + + DataReader createDefaultDataReader(boolean useZeroCopy); + {code} You really need to look through all of the interfaces and fix them before merging into master. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-12055) Create row-by-row shims for the write path
Owen O'Malley created HIVE-12055: Summary: Create row-by-row shims for the write path Key: HIVE-12055 URL: https://issues.apache.org/jira/browse/HIVE-12055 Project: Hive Issue Type: Sub-task Reporter: Owen O'Malley Assignee: Owen O'Malley As part of removing the row-by-row writer, we'll need to shim out the higher level API (OrcSerde and OrcOutputFormat) so that we maintain backwards compatibility. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-12054) Create vectorized write method
Owen O'Malley created HIVE-12054: Summary: Create vectorized write method Key: HIVE-12054 URL: https://issues.apache.org/jira/browse/HIVE-12054 Project: Hive Issue Type: Sub-task Components: File Formats Reporter: Owen O'Malley Assignee: Owen O'Malley We need to add writer methods that can write VectorizedRowBatch to an ORC file. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-11890) Create ORC module
Owen O'Malley created HIVE-11890: Summary: Create ORC module Key: HIVE-11890 URL: https://issues.apache.org/jira/browse/HIVE-11890 Project: Hive Issue Type: Bug Reporter: Owen O'Malley Start moving classes over to the ORC module. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-11807) Set ORC buffer size in relation to set stripe size
Owen O'Malley created HIVE-11807: Summary: Set ORC buffer size in relation to set stripe size Key: HIVE-11807 URL: https://issues.apache.org/jira/browse/HIVE-11807 Project: Hive Issue Type: Improvement Components: File Formats Reporter: Owen O'Malley Assignee: Owen O'Malley A customer produced ORC files with very small stripe sizes (10k rows/stripe) by setting a small 64MB stripe size and 256K buffer size for a 54 column table. At that size, each of the streams only get a buffer or two before the stripe size is reached. The current code uses the available memory instead of the stripe size and thus doesn't shrink the buffer size if the JVM has much more memory than the stripe size. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-11808) In ORC removing the dynamic dispatch for StringTreeReader improves read by 10%
Owen O'Malley created HIVE-11808: Summary: In ORC removing the dynamic dispatch for StringTreeReader improves read by 10% Key: HIVE-11808 URL: https://issues.apache.org/jira/browse/HIVE-11808 Project: Hive Issue Type: Bug Reporter: Owen O'Malley Assignee: Owen O'Malley When we introduced the dictionary/direct encodings for ORC, we made subclasses of StringTreeReader named StringDirectTreeReader and StringDictionaryTreeReader and introduce an additional dynamic dispatch in the inner loop. For tables with a lot of string columns, removing that extra dispatch improves performance 10%. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-11704) Create errata.txt file
Owen O'Malley created HIVE-11704: Summary: Create errata.txt file Key: HIVE-11704 URL: https://issues.apache.org/jira/browse/HIVE-11704 Project: Hive Issue Type: Bug Components: Documentation Reporter: Owen O'Malley Assignee: Owen O'Malley As discussed on the email list, we should have a file documenting known problems in the commit messages. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-11618) Correct the SARG api to reunify the PredicateLeaf.Type INTEGER and LONG
Owen O'Malley created HIVE-11618: Summary: Correct the SARG api to reunify the PredicateLeaf.Type INTEGER and LONG Key: HIVE-11618 URL: https://issues.apache.org/jira/browse/HIVE-11618 Project: Hive Issue Type: Bug Components: Types Reporter: Owen O'Malley The Parquet binding leaked implementation details into the generic SARG api. Rather than make all users of the SARG api deal with each of the specific types, reunify the INTEGER and LONG types. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-11417) Create ObjectInspectors for VectorizedRowBatch
Owen O'Malley created HIVE-11417: Summary: Create ObjectInspectors for VectorizedRowBatch Key: HIVE-11417 URL: https://issues.apache.org/jira/browse/HIVE-11417 Project: Hive Issue Type: Sub-task Reporter: Owen O'Malley Assignee: Owen O'Malley I'd like to make the default path for reading and writing ORC files to be vectorized. To ensure that Hive can still read row by row, I'll make ObjectInspectors that are backed by the VectorizedRowBatch. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-11370) Extend SARGs to support binary type
Owen O'Malley created HIVE-11370: Summary: Extend SARGs to support binary type Key: HIVE-11370 URL: https://issues.apache.org/jira/browse/HIVE-11370 Project: Hive Issue Type: Bug Reporter: Owen O'Malley Currently the sargs only apply to string, boolean, integer, decimal, floating, date, and timestamp columns. It would be good to support binary blobs also. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-11321) Move OrcFile.OrcTableProperties from OrcFile into OrcConf.
Owen O'Malley created HIVE-11321: Summary: Move OrcFile.OrcTableProperties from OrcFile into OrcConf. Key: HIVE-11321 URL: https://issues.apache.org/jira/browse/HIVE-11321 Project: Hive Issue Type: Sub-task Reporter: Owen O'Malley Assignee: Owen O'Malley We should pull all of the configuration/table property knobs into a single list. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-11307) Remove getWritableObject from ColumnVectorBatch
Owen O'Malley created HIVE-11307: Summary: Remove getWritableObject from ColumnVectorBatch Key: HIVE-11307 URL: https://issues.apache.org/jira/browse/HIVE-11307 Project: Hive Issue Type: Sub-task Components: Vectorization Reporter: Owen O'Malley Assignee: Owen O'Malley Fix For: 2.0.0 ColumnVectorBatch.getWritableObject is only used in a few tests and is really problematic when adding the complex types to vectorization. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-11253) Move SearchArgument and VectorizedRowBatch classes to storage-api.
Owen O'Malley created HIVE-11253: Summary: Move SearchArgument and VectorizedRowBatch classes to storage-api. Key: HIVE-11253 URL: https://issues.apache.org/jira/browse/HIVE-11253 Project: Hive Issue Type: Sub-task Reporter: Owen O'Malley -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-11245) Fix the LLAP to ORC APIs
Owen O'Malley created HIVE-11245: Summary: Fix the LLAP to ORC APIs Key: HIVE-11245 URL: https://issues.apache.org/jira/browse/HIVE-11245 Project: Hive Issue Type: Bug Reporter: Owen O'Malley Priority: Blocker Fix For: llap Currently the LLAP branch has refactored the ORC code to have different code paths depending on whether the data is coming from the cache or a FileSystem. We need to introduce a concept of a DataSource that is responsible for getting the necessary bytes regardless of whether they are coming from a FileSystem, in memory cache, or both. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-11209) Clean up dependencies in HiveDecimalWritable
Owen O'Malley created HIVE-11209: Summary: Clean up dependencies in HiveDecimalWritable Key: HIVE-11209 URL: https://issues.apache.org/jira/browse/HIVE-11209 Project: Hive Issue Type: Sub-task Reporter: Owen O'Malley Assignee: Owen O'Malley Currently HiveDecimalWritable depends on: * org.apache.hadoop.hive.serde2.ByteStream * org.apache.hadoop.hive.serde2.lazybinary.LazyBinaryUtils * org.apache.hadoop.hive.serde2.typeinfo.HiveDecimalUtils since we need HiveDecimalWritable for the decimal VectorizedColumnBatch, breaking these dependencies will improve things. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-11210) Remove dependency on HiveConf from Orc reader writer
Owen O'Malley created HIVE-11210: Summary: Remove dependency on HiveConf from Orc reader writer Key: HIVE-11210 URL: https://issues.apache.org/jira/browse/HIVE-11210 Project: Hive Issue Type: Sub-task Reporter: Owen O'Malley Assignee: Owen O'Malley Currently the ORC reader and writer get their default values from HiveConf. I propose that we make the reader and writer have their own programatic defaults and the OrcInputFormat and OrcOutputFormat can use the version in HiveConf. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-11212) Create vectorized types for complex types
Owen O'Malley created HIVE-11212: Summary: Create vectorized types for complex types Key: HIVE-11212 URL: https://issues.apache.org/jira/browse/HIVE-11212 Project: Hive Issue Type: Sub-task Reporter: Owen O'Malley Assignee: Owen O'Malley We need vectorized types for structs, maps, lists, and unions. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-11144) Replace row by row reader and writer with shims to vectorized path.
Owen O'Malley created HIVE-11144: Summary: Replace row by row reader and writer with shims to vectorized path. Key: HIVE-11144 URL: https://issues.apache.org/jira/browse/HIVE-11144 Project: Hive Issue Type: Sub-task Reporter: Owen O'Malley Assignee: Owen O'Malley The core ORC reader and writer will be better served if the vectorized read and write paths are the primary API and the row by row reader and writer and their corresponding object inspectors become Hive-specific shims. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-11137) In DateWritable remove the use of LazyBinaryUtils
Owen O'Malley created HIVE-11137: Summary: In DateWritable remove the use of LazyBinaryUtils Key: HIVE-11137 URL: https://issues.apache.org/jira/browse/HIVE-11137 Project: Hive Issue Type: Sub-task Reporter: Owen O'Malley Assignee: Owen O'Malley Currently the DateWritable class uses LazyBinaryUtils, which has a lot of dependencies. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-11124) Move OrcRecordUpdater.getAcidEventFields to RecordReaderFactory
Owen O'Malley created HIVE-11124: Summary: Move OrcRecordUpdater.getAcidEventFields to RecordReaderFactory Key: HIVE-11124 URL: https://issues.apache.org/jira/browse/HIVE-11124 Project: Hive Issue Type: Sub-task Reporter: Owen O'Malley Assignee: Owen O'Malley Move OrcRecordUpdater.getAcidEventFields to RecordReaderFactory to avoid the extra dependence. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-11115) Remove dependence from ORC's WriterImpl to OrcInputFormat
Owen O'Malley created HIVE-5: Summary: Remove dependence from ORC's WriterImpl to OrcInputFormat Key: HIVE-5 URL: https://issues.apache.org/jira/browse/HIVE-5 Project: Hive Issue Type: Sub-task Reporter: Owen O'Malley Assignee: Owen O'Malley Currently there is a link from WriterImpl to OrcInputFormat that should be removed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-11086) Remove use of ErrorMsg in Orc's RunLengthIntegerReaderV2
Owen O'Malley created HIVE-11086: Summary: Remove use of ErrorMsg in Orc's RunLengthIntegerReaderV2 Key: HIVE-11086 URL: https://issues.apache.org/jira/browse/HIVE-11086 Project: Hive Issue Type: Sub-task Reporter: Owen O'Malley Assignee: Owen O'Malley ORC's rle v2 reader uses a string literal from ErrorMsg, which forces a large dependency on the rle v2 reader. Pulling the string literal in directly doesn't change the behavior and fixes the linkage. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-11080) Modify VectorizedRowBatch.toString() to not depend on VectorExpressionWriter
Owen O'Malley created HIVE-11080: Summary: Modify VectorizedRowBatch.toString() to not depend on VectorExpressionWriter Key: HIVE-11080 URL: https://issues.apache.org/jira/browse/HIVE-11080 Project: Hive Issue Type: Sub-task Reporter: Owen O'Malley Assignee: Owen O'Malley Currently the VectorizedRowBatch.toString method uses the VectorExpressionWriter to convert the row batch to a string. Since the string is only used for printing error messages, I'd propose making the toString use the types of the vector batch instead of the object inspector. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-10798) Remove dependence on VectorizedBatchUtil from VectorizedOrcAcidRowReader
Owen O'Malley created HIVE-10798: Summary: Remove dependence on VectorizedBatchUtil from VectorizedOrcAcidRowReader Key: HIVE-10798 URL: https://issues.apache.org/jira/browse/HIVE-10798 Project: Hive Issue Type: Sub-task Reporter: Owen O'Malley Assignee: Owen O'Malley VectorizedBatchUtil has a lot of dependences that Orc should avoid and the code should be refactored. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-10796) Remove dependencies on NumericHistogram and NumDistinctValueEstimator from JavaDataModel
Owen O'Malley created HIVE-10796: Summary: Remove dependencies on NumericHistogram and NumDistinctValueEstimator from JavaDataModel Key: HIVE-10796 URL: https://issues.apache.org/jira/browse/HIVE-10796 Project: Hive Issue Type: Sub-task Reporter: Owen O'Malley Assignee: Owen O'Malley The JavaDataModel class is used in a lot of places and the non-general calculations are better done in the other classes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-10797) Simplify the test for vectorized input
Owen O'Malley created HIVE-10797: Summary: Simplify the test for vectorized input Key: HIVE-10797 URL: https://issues.apache.org/jira/browse/HIVE-10797 Project: Hive Issue Type: Sub-task Reporter: Owen O'Malley Assignee: Owen O'Malley The call to Utilities.isVectorMode should be simplified for the readers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-10795) Remove use of PerfLogger from Orc
Owen O'Malley created HIVE-10795: Summary: Remove use of PerfLogger from Orc Key: HIVE-10795 URL: https://issues.apache.org/jira/browse/HIVE-10795 Project: Hive Issue Type: Sub-task Reporter: Owen O'Malley Assignee: Owen O'Malley PerfLogger is yet another class with a huge dependency set that Orc doesn't need. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-10799) Refactor the SearchArgumentFactory to remove the dependence on ExprNodeGenericFuncDesc
Owen O'Malley created HIVE-10799: Summary: Refactor the SearchArgumentFactory to remove the dependence on ExprNodeGenericFuncDesc Key: HIVE-10799 URL: https://issues.apache.org/jira/browse/HIVE-10799 Project: Hive Issue Type: Sub-task Reporter: Owen O'Malley Assignee: Owen O'Malley SearchArgumentFactory and SearchArgumentImpl are high level and shouldn't depend on the internals of Hive's AST model. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-10794) Remove the dependence from ErrorMsg to HiveUtils
Owen O'Malley created HIVE-10794: Summary: Remove the dependence from ErrorMsg to HiveUtils Key: HIVE-10794 URL: https://issues.apache.org/jira/browse/HIVE-10794 Project: Hive Issue Type: Sub-task Reporter: Owen O'Malley HiveUtils has a large set of dependencies and ErrorMsg only needs the new line constant. Breaking the dependence will reduce the dependency set from ErrorMsg significantly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-10407) separate out the timestamp ranges for testing purposes
Owen O'Malley created HIVE-10407: Summary: separate out the timestamp ranges for testing purposes Key: HIVE-10407 URL: https://issues.apache.org/jira/browse/HIVE-10407 Project: Hive Issue Type: Bug Reporter: Owen O'Malley Assignee: Owen O'Malley Some platforms have limits for date ranges, so separate out the test cases that are outside of the range 1970 to 2038. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-10305) TestOrcFile has a mistake that makes metadata test ineffective
Owen O'Malley created HIVE-10305: Summary: TestOrcFile has a mistake that makes metadata test ineffective Key: HIVE-10305 URL: https://issues.apache.org/jira/browse/HIVE-10305 Project: Hive Issue Type: Bug Reporter: Owen O'Malley Assignee: Owen O'Malley Two of the values that are being stored as user metadata in TestOrcFile.metaData weren't flipped and thus were empty buffers. The test passes because they are compared to empty buffers. We should fix the test to perform the expected test. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-10171) Create a storage-api module
Owen O'Malley created HIVE-10171: Summary: Create a storage-api module Key: HIVE-10171 URL: https://issues.apache.org/jira/browse/HIVE-10171 Project: Hive Issue Type: Bug Reporter: Owen O'Malley Assignee: Owen O'Malley To support high performance file formats, I'd like to propose that we move the minimal set of classes that are required to integrate with Hive in to a new module named storage-api. This module will include VectorizedRowBatch, the various ColumnVector classes, and the SARG classes. It will form the start of an API that high performance storage formats can use to integrate with Hive. Both ORC and Parquet can use the new API to support vectorization and SARGs without performance destroying shims. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9593) ORC Reader should ignore unknown metadata streams
[ https://issues.apache.org/jira/browse/HIVE-9593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley updated HIVE-9593: Resolution: Fixed Fix Version/s: 1.1.0 1.0.1 Status: Resolved (was: Patch Available) I committed this. Thanks for the review, Gopal! ORC Reader should ignore unknown metadata streams -- Key: HIVE-9593 URL: https://issues.apache.org/jira/browse/HIVE-9593 Project: Hive Issue Type: Bug Components: File Formats Affects Versions: 0.11.0, 0.12.0, 0.13.1, 1.0.0, 1.2.0, 1.1.0 Reporter: Gopal V Assignee: Owen O'Malley Fix For: 1.0.1, 1.1.0 Attachments: HIVE-9593.no-autogen.patch, hive-9593.patch ORC readers should ignore metadata streams which are non-essential additions to the main data streams. This will include additional indices, histograms or anything we add as an optional stream. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9593) ORC Reader should ignore unknown metadata streams
[ https://issues.apache.org/jira/browse/HIVE-9593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley updated HIVE-9593: Status: Patch Available (was: Open) ORC Reader should ignore unknown metadata streams -- Key: HIVE-9593 URL: https://issues.apache.org/jira/browse/HIVE-9593 Project: Hive Issue Type: Bug Components: File Formats Affects Versions: 0.13.1, 0.12.0, 0.11.0, 1.0.0, 1.2.0, 1.1.0 Reporter: Gopal V Assignee: Owen O'Malley Attachments: hive-9593.patch ORC readers should ignore metadata streams which are non-essential additions to the main data streams. This will include additional indices, histograms or anything we add as an optional stream. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9593) ORC Reader should ignore unknown metadata streams
[ https://issues.apache.org/jira/browse/HIVE-9593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley updated HIVE-9593: Attachment: hive-9593.patch This patch changes all of the required fields to be optional. I've gone through the current code to ensure that null pointers from getKind() won't cause NPE. ORC Reader should ignore unknown metadata streams -- Key: HIVE-9593 URL: https://issues.apache.org/jira/browse/HIVE-9593 Project: Hive Issue Type: Bug Components: File Formats Affects Versions: 0.11.0, 0.12.0, 0.13.1, 1.0.0, 1.2.0, 1.1.0 Reporter: Gopal V Assignee: Owen O'Malley Attachments: hive-9593.patch ORC readers should ignore metadata streams which are non-essential additions to the main data streams. This will include additional indices, histograms or anything we add as an optional stream. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9188) BloomFilter in ORC row group index
[ https://issues.apache.org/jira/browse/HIVE-9188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14302507#comment-14302507 ] Owen O'Malley commented on HIVE-9188: - Suggestions: * Pick m to always be a multiple of 64 (since you are using longs are the representation) * change the representation of BloomFilter in orc_proto to record the number of hash functions and not the size or fpp. * use fixed64 for the bit field * you'll also need to update the specification in the wiki with the change to the format (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC#LanguageManualORC-orc-specORCFormatSpecification) * revert the spurious change to CliDriver.java * revert the spurious change to .gitignore * it seems suboptimal to convert long values to bytes before hashing BloomFilter in ORC row group index -- Key: HIVE-9188 URL: https://issues.apache.org/jira/browse/HIVE-9188 Project: Hive Issue Type: New Feature Components: File Formats Affects Versions: 0.15.0 Reporter: Prasanth Jayachandran Assignee: Prasanth Jayachandran Labels: orcfile Attachments: HIVE-9188.1.patch, HIVE-9188.2.patch, HIVE-9188.3.patch, HIVE-9188.4.patch, HIVE-9188.5.patch, HIVE-9188.6.patch BloomFilters are well known probabilistic data structure for set membership checking. We can use bloom filters in ORC index for better row group pruning. Currently, ORC row group index uses min/max statistics to eliminate row groups (stripes as well) that do not satisfy predicate condition specified in the query. But in some cases, the efficiency of min/max based elimination is not optimal (unsorted columns with wide range of entries). Bloom filters can be an effective and efficient alternative for row group/split elimination for point queries or queries with IN clause. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9451) Add max size of column dictionaries to ORC metadata
[ https://issues.apache.org/jira/browse/HIVE-9451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14297178#comment-14297178 ] Owen O'Malley commented on HIVE-9451: - We should also record the stripe size that was used as the file was written. That gives a strict upper bound on the size of memory in the writer. Add max size of column dictionaries to ORC metadata --- Key: HIVE-9451 URL: https://issues.apache.org/jira/browse/HIVE-9451 Project: Hive Issue Type: Improvement Reporter: Owen O'Malley To predict the amount of memory required to read an ORC file we need to know the size of the dictionaries for the columns that we are reading. I propose adding the number of bytes for each column's dictionary to the stripe's column statistics. The file's column statistics would have the maximum dictionary size for each column. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9317) move Microsoft copyright to NOTICE file
[ https://issues.apache.org/jira/browse/HIVE-9317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14297319#comment-14297319 ] Owen O'Malley commented on HIVE-9317: - +1 to not rolling a new RC specifically for this one. I just want to make sure it goes into to any new RCs. move Microsoft copyright to NOTICE file --- Key: HIVE-9317 URL: https://issues.apache.org/jira/browse/HIVE-9317 Project: Hive Issue Type: Bug Reporter: Owen O'Malley Assignee: Owen O'Malley Priority: Blocker Fix For: 0.15.0, 1.0.0 Attachments: hive-9327.txt There are a set of files that still have the Microsoft copyright notices. Those notices need to be moved into NOTICES and replaced with the standard Apache headers. {code} ./common/src/java/org/apache/hadoop/hive/common/type/Decimal128.java ./common/src/java/org/apache/hadoop/hive/common/type/SignedInt128.java ./common/src/java/org/apache/hadoop/hive/common/type/SqlMathUtil.java ./common/src/java/org/apache/hadoop/hive/common/type/UnsignedInt128.java ./common/src/test/org/apache/hadoop/hive/common/type/TestDecimal128.java ./common/src/test/org/apache/hadoop/hive/common/type/TestSignedInt128.java ./common/src/test/org/apache/hadoop/hive/common/type/TestSqlMathUtil.java ./common/src/test/org/apache/hadoop/hive/common/type/TestUnsignedInt128.java {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9317) move Microsoft copyright to NOTICE file
[ https://issues.apache.org/jira/browse/HIVE-9317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley updated HIVE-9317: Resolution: Fixed Fix Version/s: 1.0.0 Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) I committed this. Thanks for the review, Alan. move Microsoft copyright to NOTICE file --- Key: HIVE-9317 URL: https://issues.apache.org/jira/browse/HIVE-9317 Project: Hive Issue Type: Bug Reporter: Owen O'Malley Assignee: Owen O'Malley Priority: Blocker Fix For: 0.15.0, 1.0.0 Attachments: hive-9327.txt There are a set of files that still have the Microsoft copyright notices. Those notices need to be moved into NOTICES and replaced with the standard Apache headers. {code} ./common/src/java/org/apache/hadoop/hive/common/type/Decimal128.java ./common/src/java/org/apache/hadoop/hive/common/type/SignedInt128.java ./common/src/java/org/apache/hadoop/hive/common/type/SqlMathUtil.java ./common/src/java/org/apache/hadoop/hive/common/type/UnsignedInt128.java ./common/src/test/org/apache/hadoop/hive/common/type/TestDecimal128.java ./common/src/test/org/apache/hadoop/hive/common/type/TestSignedInt128.java ./common/src/test/org/apache/hadoop/hive/common/type/TestSqlMathUtil.java ./common/src/test/org/apache/hadoop/hive/common/type/TestUnsignedInt128.java {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9317) move Microsoft copyright to NOTICE file
[ https://issues.apache.org/jira/browse/HIVE-9317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley updated HIVE-9317: Attachment: hive-9327.txt This patch changes no code, just puts the required Apache header on the source files and moves Microsoft's copyright notice to the NOTICE file. move Microsoft copyright to NOTICE file --- Key: HIVE-9317 URL: https://issues.apache.org/jira/browse/HIVE-9317 Project: Hive Issue Type: Bug Reporter: Owen O'Malley Fix For: 0.15.0 Attachments: hive-9327.txt There are a set of files that still have the Microsoft copyright notices. Those notices need to be moved into NOTICES and replaced with the standard Apache headers. {code} ./common/src/java/org/apache/hadoop/hive/common/type/Decimal128.java ./common/src/java/org/apache/hadoop/hive/common/type/SignedInt128.java ./common/src/java/org/apache/hadoop/hive/common/type/SqlMathUtil.java ./common/src/java/org/apache/hadoop/hive/common/type/UnsignedInt128.java ./common/src/test/org/apache/hadoop/hive/common/type/TestDecimal128.java ./common/src/test/org/apache/hadoop/hive/common/type/TestSignedInt128.java ./common/src/test/org/apache/hadoop/hive/common/type/TestSqlMathUtil.java ./common/src/test/org/apache/hadoop/hive/common/type/TestUnsignedInt128.java {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9317) move Microsoft copyright to NOTICE file
[ https://issues.apache.org/jira/browse/HIVE-9317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley updated HIVE-9317: Priority: Blocker (was: Major) move Microsoft copyright to NOTICE file --- Key: HIVE-9317 URL: https://issues.apache.org/jira/browse/HIVE-9317 Project: Hive Issue Type: Bug Reporter: Owen O'Malley Assignee: Owen O'Malley Priority: Blocker Fix For: 0.15.0 Attachments: hive-9327.txt There are a set of files that still have the Microsoft copyright notices. Those notices need to be moved into NOTICES and replaced with the standard Apache headers. {code} ./common/src/java/org/apache/hadoop/hive/common/type/Decimal128.java ./common/src/java/org/apache/hadoop/hive/common/type/SignedInt128.java ./common/src/java/org/apache/hadoop/hive/common/type/SqlMathUtil.java ./common/src/java/org/apache/hadoop/hive/common/type/UnsignedInt128.java ./common/src/test/org/apache/hadoop/hive/common/type/TestDecimal128.java ./common/src/test/org/apache/hadoop/hive/common/type/TestSignedInt128.java ./common/src/test/org/apache/hadoop/hive/common/type/TestSqlMathUtil.java ./common/src/test/org/apache/hadoop/hive/common/type/TestUnsignedInt128.java {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (HIVE-9317) move Microsoft copyright to NOTICE file
[ https://issues.apache.org/jira/browse/HIVE-9317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley reassigned HIVE-9317: --- Assignee: Owen O'Malley move Microsoft copyright to NOTICE file --- Key: HIVE-9317 URL: https://issues.apache.org/jira/browse/HIVE-9317 Project: Hive Issue Type: Bug Reporter: Owen O'Malley Assignee: Owen O'Malley Fix For: 0.15.0 Attachments: hive-9327.txt There are a set of files that still have the Microsoft copyright notices. Those notices need to be moved into NOTICES and replaced with the standard Apache headers. {code} ./common/src/java/org/apache/hadoop/hive/common/type/Decimal128.java ./common/src/java/org/apache/hadoop/hive/common/type/SignedInt128.java ./common/src/java/org/apache/hadoop/hive/common/type/SqlMathUtil.java ./common/src/java/org/apache/hadoop/hive/common/type/UnsignedInt128.java ./common/src/test/org/apache/hadoop/hive/common/type/TestDecimal128.java ./common/src/test/org/apache/hadoop/hive/common/type/TestSignedInt128.java ./common/src/test/org/apache/hadoop/hive/common/type/TestSqlMathUtil.java ./common/src/test/org/apache/hadoop/hive/common/type/TestUnsignedInt128.java {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9317) move Microsoft copyright to NOTICE file
[ https://issues.apache.org/jira/browse/HIVE-9317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley updated HIVE-9317: Status: Patch Available (was: Open) move Microsoft copyright to NOTICE file --- Key: HIVE-9317 URL: https://issues.apache.org/jira/browse/HIVE-9317 Project: Hive Issue Type: Bug Reporter: Owen O'Malley Assignee: Owen O'Malley Priority: Blocker Fix For: 0.15.0 Attachments: hive-9327.txt There are a set of files that still have the Microsoft copyright notices. Those notices need to be moved into NOTICES and replaced with the standard Apache headers. {code} ./common/src/java/org/apache/hadoop/hive/common/type/Decimal128.java ./common/src/java/org/apache/hadoop/hive/common/type/SignedInt128.java ./common/src/java/org/apache/hadoop/hive/common/type/SqlMathUtil.java ./common/src/java/org/apache/hadoop/hive/common/type/UnsignedInt128.java ./common/src/test/org/apache/hadoop/hive/common/type/TestDecimal128.java ./common/src/test/org/apache/hadoop/hive/common/type/TestSignedInt128.java ./common/src/test/org/apache/hadoop/hive/common/type/TestSqlMathUtil.java ./common/src/test/org/apache/hadoop/hive/common/type/TestUnsignedInt128.java {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-9467) ORC - sort dictionary streams to the end of the stripe
Owen O'Malley created HIVE-9467: --- Summary: ORC - sort dictionary streams to the end of the stripe Key: HIVE-9467 URL: https://issues.apache.org/jira/browse/HIVE-9467 Project: Hive Issue Type: Bug Components: File Formats Reporter: Owen O'Malley Assignee: Owen O'Malley When reading ORC files, it would be convenient to group the dictionary streams at the end of the stripe. This would allow the reader to use fewer read operations if they want to load the dictionaries before they load the data. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-9451) Add max size of column dictionaries to ORC metadata
Owen O'Malley created HIVE-9451: --- Summary: Add max size of column dictionaries to ORC metadata Key: HIVE-9451 URL: https://issues.apache.org/jira/browse/HIVE-9451 Project: Hive Issue Type: Improvement Reporter: Owen O'Malley To predict the amount of memory required to read an ORC file we need to know the size of the dictionaries for the columns that we are reading. I propose adding the number of bytes for each column's dictionary to the stripe's column statistics. The file's column statistics would have the maximum dictionary size for each column. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8966) Delta files created by hive hcatalog streaming cannot be compacted
[ https://issues.apache.org/jira/browse/HIVE-8966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14284927#comment-14284927 ] Owen O'Malley commented on HIVE-8966: - This looks good, Alan. +1 One minor nit is that the class javadoc for ValidReadTxnList has And instead of the intended An. Delta files created by hive hcatalog streaming cannot be compacted -- Key: HIVE-8966 URL: https://issues.apache.org/jira/browse/HIVE-8966 Project: Hive Issue Type: Bug Components: HCatalog Affects Versions: 0.14.0 Environment: hive Reporter: Jihong Liu Assignee: Alan Gates Priority: Critical Fix For: 0.14.1 Attachments: HIVE-8966.2.patch, HIVE-8966.3.patch, HIVE-8966.4.patch, HIVE-8966.5.patch, HIVE-8966.patch hive hcatalog streaming will also create a file like bucket_n_flush_length in each delta directory. Where n is the bucket number. But the compactor.CompactorMR think this file also needs to compact. However this file of course cannot be compacted, so compactor.CompactorMR will not continue to do the compaction. Did a test, after removed the bucket_n_flush_length file, then the alter table partition compact finished successfully. If don't delete that file, nothing will be compacted. This is probably a very severity bug. Both 0.13 and 0.14 have this issue -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8966) Delta files created by hive hcatalog streaming cannot be compacted
[ https://issues.apache.org/jira/browse/HIVE-8966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14284935#comment-14284935 ] Owen O'Malley commented on HIVE-8966: - After a little more thought, I'm worried that someone will accidentally create a ValidCompactorTxnList and get confused by the different behavior. I think it would make sense to move it into the compactor package to minimize the chance that someone accidentally uses it by mistake. Delta files created by hive hcatalog streaming cannot be compacted -- Key: HIVE-8966 URL: https://issues.apache.org/jira/browse/HIVE-8966 Project: Hive Issue Type: Bug Components: HCatalog Affects Versions: 0.14.0 Environment: hive Reporter: Jihong Liu Assignee: Alan Gates Priority: Critical Fix For: 0.14.1 Attachments: HIVE-8966.2.patch, HIVE-8966.3.patch, HIVE-8966.4.patch, HIVE-8966.5.patch, HIVE-8966.patch hive hcatalog streaming will also create a file like bucket_n_flush_length in each delta directory. Where n is the bucket number. But the compactor.CompactorMR think this file also needs to compact. However this file of course cannot be compacted, so compactor.CompactorMR will not continue to do the compaction. Did a test, after removed the bucket_n_flush_length file, then the alter table partition compact finished successfully. If don't delete that file, nothing will be compacted. This is probably a very severity bug. Both 0.13 and 0.14 have this issue -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9188) BloomFilter in ORC row group index
[ https://issues.apache.org/jira/browse/HIVE-9188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275997#comment-14275997 ] Owen O'Malley commented on HIVE-9188: - [~prasanth_j] Please remove the upper two levels of bloom filters. They are utterly useless. Their false positive rate will be far above 99%. They absolutely should not be stored in the column statistics. That will hurt the common ppd case and not help. BloomFilter in ORC row group index -- Key: HIVE-9188 URL: https://issues.apache.org/jira/browse/HIVE-9188 Project: Hive Issue Type: New Feature Components: File Formats Affects Versions: 0.15.0 Reporter: Prasanth Jayachandran Assignee: Prasanth Jayachandran Labels: orcfile Attachments: HIVE-9188.1.patch, HIVE-9188.2.patch, HIVE-9188.3.patch, HIVE-9188.4.patch BloomFilters are well known probabilistic data structure for set membership checking. We can use bloom filters in ORC index for better row group pruning. Currently, ORC row group index uses min/max statistics to eliminate row groups (stripes as well) that do not satisfy predicate condition specified in the query. But in some cases, the efficiency of min/max based elimination is not optimal (unsorted columns with wide range of entries). Bloom filters can be an effective and efficient alternative for row group/split elimination for point queries or queries with IN clause. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-9317) move Microsoft copyright to NOTICE file
Owen O'Malley created HIVE-9317: --- Summary: move Microsoft copyright to NOTICE file Key: HIVE-9317 URL: https://issues.apache.org/jira/browse/HIVE-9317 Project: Hive Issue Type: Bug Reporter: Owen O'Malley Fix For: 0.15.0 There are a set of files that still have the Microsoft copyright notices. Those notices need to be moved into NOTICES and replaced with the standard Apache headers. {code} ./common/src/java/org/apache/hadoop/hive/common/type/Decimal128.java ./common/src/java/org/apache/hadoop/hive/common/type/SignedInt128.java ./common/src/java/org/apache/hadoop/hive/common/type/SqlMathUtil.java ./common/src/java/org/apache/hadoop/hive/common/type/UnsignedInt128.java ./common/src/test/org/apache/hadoop/hive/common/type/TestDecimal128.java ./common/src/test/org/apache/hadoop/hive/common/type/TestSignedInt128.java ./common/src/test/org/apache/hadoop/hive/common/type/TestSqlMathUtil.java ./common/src/test/org/apache/hadoop/hive/common/type/TestUnsignedInt128.java {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9188) BloomFilter in ORC row group index
[ https://issues.apache.org/jira/browse/HIVE-9188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268573#comment-14268573 ] Owen O'Malley commented on HIVE-9188: - [~prasanth_j] Ok, I thought that you said that you were going to have bloom filters at row group, stripe, and file level. I agree completely that ORC should only have bloom filters at the row group level. Having the bloom filter as a separate stream means the reader does *far* less IO. It will still go through the code that merges adjacent ranges together into a single read. So if you need all of the indexes and bloom filters for all of the columns the reader should read them in a single IO operation. On the other hand, if it doesn't need any bloom filter it shouldn't have to load the extra mb of data it doesn't need. BloomFilter in ORC row group index -- Key: HIVE-9188 URL: https://issues.apache.org/jira/browse/HIVE-9188 Project: Hive Issue Type: New Feature Components: File Formats Affects Versions: 0.15.0 Reporter: Prasanth Jayachandran Assignee: Prasanth Jayachandran Labels: orcfile Attachments: HIVE-9188.1.patch, HIVE-9188.2.patch, HIVE-9188.3.patch, HIVE-9188.4.patch BloomFilters are well known probabilistic data structure for set membership checking. We can use bloom filters in ORC index for better row group pruning. Currently, ORC row group index uses min/max statistics to eliminate row groups (stripes as well) that do not satisfy predicate condition specified in the query. But in some cases, the efficiency of min/max based elimination is not optimal (unsorted columns with wide range of entries). Bloom filters can be an effective and efficient alternative for row group/split elimination for point queries or queries with IN clause. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9188) BloomFilter in ORC row group index
[ https://issues.apache.org/jira/browse/HIVE-9188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268176#comment-14268176 ] Owen O'Malley commented on HIVE-9188: - [~gopalv] I don't understand your concern. The indexes are already stored in ROW_INDEX streams. I'm just saying that the bloom filters, which are much larger than the rest of the ROW_INDEX be split into a BLOOM_FILTER stream instead of bundled in with the ROW_INDEX stream. That would let you load just the ROW_INDEX if you don't need the bloom filter. The size of the bloom filter needs to be changed relative to the number of items. You've sized them for the default row group size (n = 10,000, p=0.05) - 7.8kb. To use them at the file level, you'd need to make the bloom filters much much much larger. For a file with 100 million values in a column, you'd need a 74mb bloom filter. I'd propose that you only do the bloom filters at the row group level and scale them to match the row index stride rather than just use the default 10k. BloomFilter in ORC row group index -- Key: HIVE-9188 URL: https://issues.apache.org/jira/browse/HIVE-9188 Project: Hive Issue Type: New Feature Components: File Formats Affects Versions: 0.15.0 Reporter: Prasanth Jayachandran Assignee: Prasanth Jayachandran Labels: orcfile Attachments: HIVE-9188.1.patch, HIVE-9188.2.patch, HIVE-9188.3.patch, HIVE-9188.4.patch BloomFilters are well known probabilistic data structure for set membership checking. We can use bloom filters in ORC index for better row group pruning. Currently, ORC row group index uses min/max statistics to eliminate row groups (stripes as well) that do not satisfy predicate condition specified in the query. But in some cases, the efficiency of min/max based elimination is not optimal (unsorted columns with wide range of entries). Bloom filters can be an effective and efficient alternative for row group/split elimination for point queries or queries with IN clause. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-4639) Add has null flag to ORC internal index
[ https://issues.apache.org/jira/browse/HIVE-4639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268053#comment-14268053 ] Owen O'Malley commented on HIVE-4639: - You should encode four values: no_values, all_nulls, some_nulls, no_nulls This will allow you to support a richer set of sargs. Add has null flag to ORC internal index --- Key: HIVE-4639 URL: https://issues.apache.org/jira/browse/HIVE-4639 Project: Hive Issue Type: Improvement Components: File Formats Reporter: Owen O'Malley Assignee: Prasanth Jayachandran Attachments: HIVE-4639.1.patch It would enable more predicate pushdown if we added a flag to the index entry recording if there were any null values in the column for the 10k rows. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9188) BloomFilter in ORC row group index
[ https://issues.apache.org/jira/browse/HIVE-9188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267993#comment-14267993 ] Owen O'Malley commented on HIVE-9188: - I'm concerned about the size of the bloom filters and making them an integrated part of the column statistics. I think we'd do much better to make a BLOOM_FILTER stream kind and place them in a completely separate stream. That would allow the predicate push down to only load the bloom filters for the columns that it needs. BloomFilter in ORC row group index -- Key: HIVE-9188 URL: https://issues.apache.org/jira/browse/HIVE-9188 Project: Hive Issue Type: New Feature Components: File Formats Affects Versions: 0.15.0 Reporter: Prasanth Jayachandran Assignee: Prasanth Jayachandran Labels: orcfile Attachments: HIVE-9188.1.patch, HIVE-9188.2.patch, HIVE-9188.3.patch, HIVE-9188.4.patch BloomFilters are well known probabilistic data structure for set membership checking. We can use bloom filters in ORC index for better row group pruning. Currently, ORC row group index uses min/max statistics to eliminate row groups (stripes as well) that do not satisfy predicate condition specified in the query. But in some cases, the efficiency of min/max based elimination is not optimal (unsorted columns with wide range of entries). Bloom filters can be an effective and efficient alternative for row group/split elimination for point queries or queries with IN clause. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9166) Place an upper bound for SARG CNF conversion
[ https://issues.apache.org/jira/browse/HIVE-9166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14252406#comment-14252406 ] Owen O'Malley commented on HIVE-9166: - +1 LGTM You probably should add a test case where there is something other than the large CNF. something like (and leaf-1 (or ...)) You should end up with leaf-1 as your final expression. Place an upper bound for SARG CNF conversion Key: HIVE-9166 URL: https://issues.apache.org/jira/browse/HIVE-9166 Project: Hive Issue Type: Bug Affects Versions: 0.14.0, 0.15.0 Reporter: Prasanth Jayachandran Assignee: Prasanth Jayachandran Labels: orcfile Attachments: HIVE-9166.1.patch, HIVE-9166.2.patch SARG creation in ORC, applies several optimizations to expression tree. In that CNF conversion is an exponential algorithm as it finds all combinations of expressions when converting from OR of AND form to AND of OR form (CNF). We need an upper bound for this algorithm to prevent it from running for long time and generating huge combinations list. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8966) Delta files created by hive hcatalog streaming cannot be compacted
[ https://issues.apache.org/jira/browse/HIVE-8966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14240415#comment-14240415 ] Owen O'Malley commented on HIVE-8966: - Alan, your patch looks good +1 Delta files created by hive hcatalog streaming cannot be compacted -- Key: HIVE-8966 URL: https://issues.apache.org/jira/browse/HIVE-8966 Project: Hive Issue Type: Bug Components: HCatalog Affects Versions: 0.14.0 Environment: hive Reporter: Jihong Liu Assignee: Alan Gates Priority: Critical Fix For: 0.14.1 Attachments: HIVE-8966.2.patch, HIVE-8966.patch hive hcatalog streaming will also create a file like bucket_n_flush_length in each delta directory. Where n is the bucket number. But the compactor.CompactorMR think this file also needs to compact. However this file of course cannot be compacted, so compactor.CompactorMR will not continue to do the compaction. Did a test, after removed the bucket_n_flush_length file, then the alter table partition compact finished successfully. If don't delete that file, nothing will be compacted. This is probably a very severity bug. Both 0.13 and 0.14 have this issue -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8880) non-synchronized access to split list in OrcInputFormat
[ https://issues.apache.org/jira/browse/HIVE-8880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14236177#comment-14236177 ] Owen O'Malley commented on HIVE-8880: - +1, this is good. non-synchronized access to split list in OrcInputFormat --- Key: HIVE-8880 URL: https://issues.apache.org/jira/browse/HIVE-8880 Project: Hive Issue Type: Bug Affects Versions: 0.14.0 Reporter: Alan Gates Assignee: Alan Gates Fix For: 0.14.1 Attachments: HIVE-8880.patch When adding delta files to the list of orc splits access to the list is not synchronized though it is shared across threads. All other additions to the list are synchronized. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8732) ORC string statistics are not merged correctly
[ https://issues.apache.org/jira/browse/HIVE-8732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14202126#comment-14202126 ] Owen O'Malley commented on HIVE-8732: - I should also point out that I added a line to the orcfiledump with a line about the version. New files will get the line: File Version: 0.12 with HIVE_8732 Files written by the old writer will say either: File Version: 0.12 with ORIGINAL or File Version: 0.11 with ORIGINAL ORC string statistics are not merged correctly -- Key: HIVE-8732 URL: https://issues.apache.org/jira/browse/HIVE-8732 Project: Hive Issue Type: Bug Components: File Formats Reporter: Owen O'Malley Assignee: Owen O'Malley Priority: Blocker Fix For: 0.14.0 Attachments: HIVE-8732.patch, HIVE-8732.patch, HIVE-8732.patch Currently ORC's string statistics do not merge correctly causing incorrect maximum values. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-8732) ORC string statistics are not merged correctly
[ https://issues.apache.org/jira/browse/HIVE-8732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley updated HIVE-8732: Attachment: HIVE-8732.patch I had to fix some minor problems and update a bunch of qfile tests because the ORC files are now 2 bytes longer. ORC string statistics are not merged correctly -- Key: HIVE-8732 URL: https://issues.apache.org/jira/browse/HIVE-8732 Project: Hive Issue Type: Bug Components: File Formats Reporter: Owen O'Malley Assignee: Owen O'Malley Priority: Blocker Fix For: 0.14.0 Attachments: HIVE-8732.patch, HIVE-8732.patch, HIVE-8732.patch Currently ORC's string statistics do not merge correctly causing incorrect maximum values. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-8746) ORC timestamp columns are sensitive to daylight savings time
Owen O'Malley created HIVE-8746: --- Summary: ORC timestamp columns are sensitive to daylight savings time Key: HIVE-8746 URL: https://issues.apache.org/jira/browse/HIVE-8746 Project: Hive Issue Type: Bug Reporter: Owen O'Malley Assignee: Owen O'Malley Hive uses Java's Timestamp class to manipulate timestamp columns. Unfortunately the textual parsing in Timestamp is done in local time and the internal storage is in UTC. ORC mostly side steps this issue by storing the difference between the time and a base time also in local and storing that difference in the file. Reading the file between timezones will mostly work correctly 2014-01-01 12:34:56 will read correctly in every timezone. However, when moving between timezones with different daylight saving it creates trouble. In particular, moving from a computer in PST to UTC will read 2014-06-06 12:34:56 as 2014-06-06 11:34:56. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8732) ORC string statistics are not merged correctly
[ https://issues.apache.org/jira/browse/HIVE-8732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14199199#comment-14199199 ] Owen O'Malley commented on HIVE-8732: - I've created the timestamp bug as HIVE-8746. The fix for that one is pretty touchy and I'll do it in 0.15 I think rather than risk the 0.14 release. I don't want to create a new write format since the old reader will read the corrected files. I will add a flag that I can use to suppress using the split elimination code for files with broken stripe/file indexes. Does that sound reasonable? ORC string statistics are not merged correctly -- Key: HIVE-8732 URL: https://issues.apache.org/jira/browse/HIVE-8732 Project: Hive Issue Type: Bug Components: File Formats Reporter: Owen O'Malley Assignee: Owen O'Malley Priority: Blocker Fix For: 0.14.0 Attachments: HIVE-8732.patch Currently ORC's string statistics do not merge correctly causing incorrect maximum values. -- This message was sent by Atlassian JIRA (v6.3.4#6332)