[
https://issues.apache.org/jira/browse/HIVE-16351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Daniel Doubrovkine resolved HIVE-16351.
---------------------------------------
Resolution: Duplicate
Release Note:
Fixed it out. The problem is well described in
https://www.phdata.io/hive-corruption-due-to-newlines-and-carriage-returns. If
you set hive.query.result.fileformat to SequenceFile the query works as
expected since the query output is written in a binary format as opposed to a
text format.
Dup of https://issues.apache.org/jira/browse/HIVE-1608
> Hive confused by CR/LFs
> -----------------------
>
> Key: HIVE-16351
> URL: https://issues.apache.org/jira/browse/HIVE-16351
> Project: Hive
> Issue Type: Bug
> Components: Hive, Serializers/Deserializers
> Affects Versions: 1.2.1
> Environment: Hadoop 2.7.3
> Reporter: Daniel Doubrovkine
> Attachments: Screen Shot 2017-04-03 at 4.33.23 PM.png
>
>
> Hive is returning broken data that contains CR/LF.
> {code}
> CREATE DATABASE positron;
> CREATE EXTERNAL TABLE positron.articles (
> `_id` struct<oid:string>,
> `channel_id` struct<oid:string>,
> `exclude_google_news` boolean,
> `fair_ids` array<map<string,string>>,
> `hero_section` map<string,string>,
> `partner_ids` array<map<string,string>>,
> `description` string,
> `partner_channel_id` struct<oid:string>,
> `published` boolean,
> `published_at` map<string,string>,
> `slugs` array<string>,
> `sections` array<map<string,string>>,
> `thumbnail_image` string,
> `title` string
> )
> ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
> WITH serdeproperties ('mapping.oid' = '$oid')
> LOCATION '/user/data/positron/articles';
> SELECT
> a.slugs[SIZE(a.slugs) - 1],
> a.title,
> a.thumbnail_image
> FROM
> positron.articles as a
> WHERE a.published = true
> AND a.hero_section["type"] = "video"
> AND (a.channel_id IS NOT NULL OR a.partner_channel_id IS NOT NULL)
> AND a.thumbnail_image IS NOT NULL;
> {code}
> Note the NULLs below. The `title` field has a CR/LF in it.
> {code}
> astrid-caroline-cole-sneak-peek-realities-by-marc-gumpinger Sneak peek
> "Realities" by Marc Gumpinger
> https://artsy-media-uploads.s3.amazonaws.com/bUb1l_4g6cMhcDxEaPYDxw%2Facc_signature.png
> artsy-editorial-how-art-fairs-expanded-the-contemporary-art-market The Art
> Market, Explained: The Rise of the Art Fair
> https://artsy-media-uploads.s3.amazonaws.com/j8GIeamyufubMBgJFNHbFA%2Fartfairsex.jpg
> nolongercreek Alexandra Kehayoglou x Artsy: NULL
> No Longer Creek
> https://d32dm0rphc51dk.cloudfront.net/5oRwy7ysKHohtahIYUTE9Q/larger.jpg NULL
> kukje-gallery-teaser-trailer-kim-yong-ik Teaser Trailer | Kim Yong-Ik
> https://artsy-media-uploads.s3.amazonaws.com/mmoZcz0imuUzCavkObKgVQ%2Fkyi+thumbnail.PNG
> {code}
> {code}
> $ hive --version
> Hive 1.1.0-cdh5.6.0
> Subversion
> file:///data/jenkins/workspace/generic-package-ubuntu64-14-04/CDH5.6.0-Packaging-Hive-2016-01-28_21-19-00/hive-1.1.0+cdh5.6.0+377-1.cdh5.6.0.p0.110~trusty
> -r Unknown
> Compiled by jenkins on Thu Jan 28 21:35:50 PST 2016
> From source with checksum b4a8fadbcf1ca36d11d91805d3ec2743
> {code}
> What's very interesting is that I am not able to reproduce this locally with
> the same data with any version of hive. It only happens in our Cloudera
> cluster. Any help appreciated.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)