[ https://issues.apache.org/jira/browse/HIVE-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mayank Lahiri updated HIVE-1438: -------------------------------- Attachment: HIVE-1438.1.patch > sentences() UDF for natural language tokenization > ------------------------------------------------- > > Key: HIVE-1438 > URL: https://issues.apache.org/jira/browse/HIVE-1438 > Project: Hadoop Hive > Issue Type: New Feature > Components: Query Processor > Affects Versions: 0.7.0 > Reporter: Mayank Lahiri > Assignee: Mayank Lahiri > Fix For: 0.7.0 > > Attachments: HIVE-1438.1.patch > > > Create a generic UDF that tokenizes free-form natural language text into > sentences and words for more advanced processing, while stripping unnecessary > punctuation and being fully international-aware. Fortunately, most of this > functionality is already built into Java in the form of the i8n BreakIterator > class, so this UDF will just connect it to Hive. For example: > > SELECT sentences("Hello there! This is a UDF.") FROM somedata LIMIT 1; > [ ["Hello", "there"], ["This", "is", "a", "UDF"] ] > or > > SELECT sentences("Je m'apelle hive!!!", "fr") FROM somedata LIMIT 1; > [["Je","m'apelle","hive"]] > Notice how punctuation is maintained only where appropriate. Breaking at > sentences (and thus the nested array return type) is important for tasks like > counting the frequency of n-grams in text, which should not cross sentence > boundaries. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.