[
https://issues.apache.org/jira/browse/PIG-3215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13641196#comment-13641196
]
MIYAKAWA Taku commented on PIG-3215:
------------------------------------
Jonathan, sorry for the delay. As long as I see, you have not made a post about
the issue at [email protected]. May I make a post instead of you?
> [piggybank] Add LTSVLoader to load LTSV (Labeled Tab-separated Values) files
> ----------------------------------------------------------------------------
>
> Key: PIG-3215
> URL: https://issues.apache.org/jira/browse/PIG-3215
> Project: Pig
> Issue Type: New Feature
> Components: piggybank
> Reporter: MIYAKAWA Taku
> Assignee: MIYAKAWA Taku
> Labels: piggybank
> Attachments: LTSVLoader-6.html, LTSVLoader.html, PIG-3215-6.patch,
> PIG-3215.patch
>
>
> LTSV, or Labeled Tab-separated Values format is now getting popular in Japan
> for log files, especially of web servers. The goal of this jira is to add
> LTSVLoader in PiggyBank to load LTSV files.
> LTSV is based on TSV thus columns are separated by tab characters.
> Additionally each of columns includes a label and a value, separated by ":"
> character.
> Read about LTSV on http://ltsv.org/.
> h4. Example LTSV file (access.log)
> Columns are separated by tab characters.
> {noformat}
> host:host1.example.org req:GET /index.html ua:Opera/9.80
> host:host1.example.org req:GET /favicon.ico ua:Opera/9.80
> host:pc.example.com req:GET /news.html ua:Mozilla/5.0
> {noformat}
> h4. Usage 1: Extract fields from each line
> Users can specify an input schema and get columns as Pig fields.
> This example loads the LTSV file shown in the previous section.
> {code}
> -- Parses the access log and count the number of lines
> -- for each pair of the host column and the ua column.
> access = LOAD 'access.log' USING
> org.apache.pig.piggybank.storage.LTSVLoader('host:chararray, ua:chararray');
> grouped_access = GROUP access BY (host, ua);
> count_for_host_ua = FOREACH grouped_access GENERATE group.host, group.ua,
> COUNT(access);
> DUMP count_for_host_ua;
> {code}
> The below text will be printed out.
> {noformat}
> (host1.example.org,Opera/9.80,2)
> (pc.example.com,Firefox/5.0,1)
> {noformat}
> h4. Usage 2: Extract a map from each line
> Users can get a map for each LTSV line. The key of a map is a label of the
> LTSV column. The value of a map comes from characters after ":" in the LTSV
> column.
> {code}
> -- Parses the access log and projects the user agent field.
> access = LOAD 'access.log' USING
> org.apache.pig.piggybank.storage.LTSVLoader() AS (m:map[]);
> user_agent = FOREACH access GENERATE m#'ua' AS ua;
> DUMP user_agent;
> {code}
> The below text will be printed out.
> {noformat}
> (Opera/9.80)
> (Opera/9.80)
> (Firefox/5.0)
> {noformat}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira