Re: filtering out non English tweets using TwitterUtils

2014-11-11 Thread Ryan Compton
Fwiw if you do decide to handle language detection on your machine this
library works great on tweets https://github.com/carrotsearch/langid-java

On Tue, Nov 11, 2014, 7:52 PM Tobias Pfeiffer  wrote:

> Hi,
>
> On Wed, Nov 12, 2014 at 5:42 AM, SK  wrote:
>>
>> But getLang() is one of the methods of twitter4j.Status since version
>> 3.0.6
>> according to the doc at:
>>http://twitter4j.org/javadoc/twitter4j/Status.html#getLang--
>>
>> What version of twitter4j does Spark Streaming use?
>>
>
> 3.0.3
> https://github.com/apache/spark/blob/master/external/twitter/pom.xml#L53
>
> Tobias
>
>


Re: filtering out non English tweets using TwitterUtils

2014-11-11 Thread Tobias Pfeiffer
Hi,

On Wed, Nov 12, 2014 at 5:42 AM, SK  wrote:
>
> But getLang() is one of the methods of twitter4j.Status since version 3.0.6
> according to the doc at:
>http://twitter4j.org/javadoc/twitter4j/Status.html#getLang--
>
> What version of twitter4j does Spark Streaming use?
>

3.0.3
https://github.com/apache/spark/blob/master/external/twitter/pom.xml#L53

Tobias


Re: filtering out non English tweets using TwitterUtils

2014-11-11 Thread SK

Small typo in my code in  the previous post. That should be: 
 tweets.filter(_.getLang()=="en") 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/filtering-out-non-English-tweets-using-TwitterUtils-tp18614p18622.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: filtering out non English tweets using TwitterUtils

2014-11-11 Thread SK
Thanks for the response. I tried the following :

   tweets.filter(_.getLang()="en")

I get a compilation error:
   value getLang is not a member of twitter4j.Status

But getLang() is one of the methods of twitter4j.Status since version 3.0.6
according to the doc at:
   http://twitter4j.org/javadoc/twitter4j/Status.html#getLang--

What version of twitter4j does Spark Streaming use?

thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/filtering-out-non-English-tweets-using-TwitterUtils-tp18614p18621.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: filtering out non English tweets using TwitterUtils

2014-11-11 Thread Tathagata Das
You could get all the tweets in the stream, and then apply "filter"
transformation on the DStream of tweets to filter away non-english
tweets. The tweets in the DStream is of type twitter4j.Status which
has a field describing the language. You can use that in the filter.

Though in practice, a lot of non-english tweets are also marked as
english by Twitter. To really filter out ALL non-english tweets, you
will have to probably do some machine learning stuff to "identify"
English tweets.

On Tue, Nov 11, 2014 at 11:41 AM, SK  wrote:
> Hi,
>
> Is there a way to extract only the English language tweets when using
> TwitterUtils.createStream()? The "filters" argument specifies the strings
> that need to be contained in the tweets, but I am not sure how this can be
> used to specify the language.
>
> thanks
>
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/filtering-out-non-English-tweets-using-TwitterUtils-tp18614.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org