Re: filtering out non English tweets using TwitterUtils
Fwiw if you do decide to handle language detection on your machine this library works great on tweets https://github.com/carrotsearch/langid-java On Tue, Nov 11, 2014, 7:52 PM Tobias Pfeiffer wrote: > Hi, > > On Wed, Nov 12, 2014 at 5:42 AM, SK wrote: >> >> But getLang() is one of the methods of twitter4j.Status since version >> 3.0.6 >> according to the doc at: >>http://twitter4j.org/javadoc/twitter4j/Status.html#getLang-- >> >> What version of twitter4j does Spark Streaming use? >> > > 3.0.3 > https://github.com/apache/spark/blob/master/external/twitter/pom.xml#L53 > > Tobias > >
Re: filtering out non English tweets using TwitterUtils
Hi, On Wed, Nov 12, 2014 at 5:42 AM, SK wrote: > > But getLang() is one of the methods of twitter4j.Status since version 3.0.6 > according to the doc at: >http://twitter4j.org/javadoc/twitter4j/Status.html#getLang-- > > What version of twitter4j does Spark Streaming use? > 3.0.3 https://github.com/apache/spark/blob/master/external/twitter/pom.xml#L53 Tobias
Re: filtering out non English tweets using TwitterUtils
Small typo in my code in the previous post. That should be: tweets.filter(_.getLang()=="en") -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/filtering-out-non-English-tweets-using-TwitterUtils-tp18614p18622.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: filtering out non English tweets using TwitterUtils
Thanks for the response. I tried the following : tweets.filter(_.getLang()="en") I get a compilation error: value getLang is not a member of twitter4j.Status But getLang() is one of the methods of twitter4j.Status since version 3.0.6 according to the doc at: http://twitter4j.org/javadoc/twitter4j/Status.html#getLang-- What version of twitter4j does Spark Streaming use? thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/filtering-out-non-English-tweets-using-TwitterUtils-tp18614p18621.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: filtering out non English tweets using TwitterUtils
You could get all the tweets in the stream, and then apply "filter" transformation on the DStream of tweets to filter away non-english tweets. The tweets in the DStream is of type twitter4j.Status which has a field describing the language. You can use that in the filter. Though in practice, a lot of non-english tweets are also marked as english by Twitter. To really filter out ALL non-english tweets, you will have to probably do some machine learning stuff to "identify" English tweets. On Tue, Nov 11, 2014 at 11:41 AM, SK wrote: > Hi, > > Is there a way to extract only the English language tweets when using > TwitterUtils.createStream()? The "filters" argument specifies the strings > that need to be contained in the tweets, but I am not sure how this can be > used to specify the language. > > thanks > > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/filtering-out-non-English-tweets-using-TwitterUtils-tp18614.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org