Re: problem with wikipedia tokenizer

Uwe Schindler Tue, 19 Mar 2013 12:50:42 -0700

Read the documentation about TokenStream and how to consume them correctly. The 
same problem affecting StandardTokenizer was explained a few days before on 
this list, too.




Sashidhar Guntury <sashidhar.mo...@gmail.com> schrieb:

>hi
>
>I'm using lucene to query from wiki dump and get the categories out.
>So, I
>get the relevant documents and for every document, I call the below
>function.
>
>static List<String> getCategories(Document document) throws IOException
>{
>  List<String> categories = new ArrayList<String>();
>  String text = document.get("text");
>WikipediaTokenizer tf = new WikipediaTokenizer(new StringReader(text));
>  CharTermAttribute termAtt = tf.addAttribute(CharTermAttribute.class);
>  TypeAttribute typeAtt = tf.addAttribute(TypeAttribute.class);
>  while (tf.incrementToken())
>  {
>      String tokText = termAtt.toString();
>      if (typeAtt.type().equals(WikipediaTokenizer.CATEGORY) == true)
>      {
>         categories.add(tokText);
>      }
>   }
>   return categories;
>}
>
>but it throws the following error (at the while statement)
>
>    Exception in thread "main" java.lang.NullPointerException
>   at
>org.apache.lucene.analysis.wikipedia.WikipediaTokenizerImpl.zzRefill(WikipediaTokenizerImpl.java:574)
>   at
>org.apache.lucene.analysis.wikipedia.WikipediaTokenizerImpl.getNextToken(WikipediaTokenizerImpl.java:781)
>   at
>org.apache.lucene.analysis.wikipedia.WikipediaTokenizer.incrementToken(WikipediaTokenizer.java:200)
>   at SearchIndex.getCategories(SearchIndex.java:82)
>   at SearchIndex.main(SearchIndex.java:54)
>
>I looked at zzRefill() function but it I'm not able to understand it.
>Is
>this a known bug or something? I don't know what am I doing wrong. The
>lucene guys says that the whole wikipediaTokenizer section is in beta
>and
>maybe be subject to changes. I was hoping someone could help me.
>
>thanks
>sashidhar

--
Uwe Schindler
H.-H.-Meier-Allee 63, 28213 Bremen
http://www.thetaphi.de

Re: problem with wikipedia tokenizer

Reply via email to