[
https://issues.apache.org/jira/browse/HIVE-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12887436#action_12887436
]
John Sichi commented on HIVE-1438:
----------------------------------
For the test case, it's good that you have non-English text. However, I'm
worried that checking in non-ASCII files to Subversion may cause encoding
problems on some platforms (I've seen problems from this in the past). Let's
think of a way to avoid that () while preserving the test coverage.
Looking at the grammar file (Hive.g), there may be a way to encode Unicode
characters as hex in a character string literal.
> sentences() UDF for natural language tokenization
> -------------------------------------------------
>
> Key: HIVE-1438
> URL: https://issues.apache.org/jira/browse/HIVE-1438
> Project: Hadoop Hive
> Issue Type: New Feature
> Components: Query Processor
> Affects Versions: 0.7.0
> Reporter: Mayank Lahiri
> Assignee: Mayank Lahiri
> Fix For: 0.7.0
>
> Attachments: HIVE-1438.1.patch
>
>
> Create a generic UDF that tokenizes free-form natural language text into
> sentences and words for more advanced processing, while stripping unnecessary
> punctuation and being fully international-aware. Fortunately, most of this
> functionality is already built into Java in the form of the i8n BreakIterator
> class, so this UDF will just connect it to Hive. For example:
> > SELECT sentences("Hello there! This is a UDF.") FROM somedata LIMIT 1;
> [ ["Hello", "there"], ["This", "is", "a", "UDF"] ]
> or
> > SELECT sentences("Je m'apelle hive!!!", "fr") FROM somedata LIMIT 1;
> [["Je","m'apelle","hive"]]
> Notice how punctuation is maintained only where appropriate. Breaking at
> sentences (and thus the nested array return type) is important for tasks like
> counting the frequency of n-grams in text, which should not cross sentence
> boundaries.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.