[
https://issues.apache.org/jira/browse/HIVE-5871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13829976#comment-13829976
]
Lefty Leverenz edited comment on HIVE-5871 at 9/10/14 4:47 AM:
---
This implementation mainly relies on LazySimpleSerDe for serialization and
deserialization. I added some methods to LazyStruct to parse a row delimited by
multiple-character string. Another difference from LazySimpleSerDe is that
MultiDelimitSerDe doesn't use Base64 to encode binary fields in serialization.
Because the encoded string may interfere with the delimiter. I also modified
LazyBinary, so that when it deserializes a binary field and is unable to
Base64 decode the field, it just keeps the data unchanged. A simple use case is
as follow:
create table test (id string,hivearray arraybinary,hivemap mapstring,int)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe' WITH
SERDEPROPERTIES
(field.delimited=[,],collection.delimited=:,mapkey.delimited=@);
where field.delimited is the multiple-char field delimiter.
collection.delimited is the delimiter for collection items. mapkey.delimited is
the delimiter for keys and values in maps. We currently don't support
multiple-char for these two delimiters.
Edited 10/Sep/14 on behalf of Rui Li This comment's example differs from the
final version of the patch. See the description above for an accurate example,
and note that the SERDEPROPERTIES are *.delim rather than *.delimited.
was (Author: lirui):
This implementation mainly relies on LazySimpleSerDe for serialization and
deserialization. I added some methods to LazyStruct to parse a row delimited by
multiple-character string. Another difference from LazySimpleSerDe is that
MultiDelimitSerDe doesn't use Base64 to encode binary fields in serialization.
Because the encoded string may interfere with the delimiter. I also modified
LazyBinary, so that when it deserializes a binary field and is unable to
Base64 decode the field, it just keeps the data unchanged. A simple use case is
as follow:
create table test (id string,hivearray arraybinary,hivemap mapstring,int)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe' WITH
SERDEPROPERTIES
(field.delimited=[,],collection.delimited=:,mapkey.delimited=@);
where field.delimited is the multiple-char field delimiter.
collection.delimited is the delimiter for collection items. mapkey.delimited is
the delimiter for keys and values in maps. We currently don't support
multiple-char for these two delimiters.
Use multiple-characters as field delimiter
--
Key: HIVE-5871
URL: https://issues.apache.org/jira/browse/HIVE-5871
Project: Hive
Issue Type: Improvement
Components: Contrib
Affects Versions: 0.12.0
Reporter: Rui Li
Assignee: Rui Li
Labels: TODOC14
Fix For: 0.14.0
Attachments: HIVE-5871.2.patch, HIVE-5871.3.patch, HIVE-5871.4.patch,
HIVE-5871.5.patch, HIVE-5871.6.patch, HIVE-5871.patch
By default, hive only allows user to use single character as field delimiter.
Although there's RegexSerDe to specify multiple-character delimiter, it can
be daunting to use, especially for amateurs.
The patch adds a new SerDe named MultiDelimitSerDe. With MultiDelimitSerDe,
users can specify a multiple-character field delimiter when creating tables,
in a way most similar to typical table creations. For example:
{code}
create table test (id string,hivearray arraybinary,hivemap mapstring,int)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe'
WITH SERDEPROPERTIES
(field.delim=[,],collection.delim=:,mapkey.delim=@);
{code}
where {{field.delim}} is the field delimiter, {{collection.delim}} and
{{mapkey.delim}} is the delimiter for collection items and key value pairs,
respectively. Among these delimiters, {{field.delim}} is mandatory and can be
of multiple characters, while {{collection.delim}} and {{mapkey.delim}} is
optional and only support single character.
To use MultiDelimitSerDe, you have to add the hive-contrib jar to the class
path, e.g. with the {{add jar}} command.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)