[ 
https://issues.apache.org/jira/browse/STREAMS-79?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13996564#comment-13996564
 ] 

Matthew Hager commented on STREAMS-79:
--------------------------------------

This is a very well known problem. I would suggest looking into Lucene's 
library to extract these tokens. While this is very straight forward for a 
language like English, Spanish, or even Russian. This gets much more 
complicated when working with languages like Chinese, Japanese, and Hindi. 

Twitter had this exact same problem and used Lucene to solve it and saw an 8x 
improvement in performance. I can point you to some examples if it would be 
helpful.

> RegEx Extractor Module
> ----------------------
>
>                 Key: STREAMS-79
>                 URL: https://issues.apache.org/jira/browse/STREAMS-79
>             Project: Streams
>          Issue Type: New Feature
>            Reporter: Matt Franklin
>
> Some data sources do not separate out shared links, hashtags and @mentions.  
> This module will use predefined regular expressions to parse the content of 
> an Activity object to extract these entities.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to