Read the documentation about TokenStream and how to consume them correctly. The same problem affecting StandardTokenizer was explained a few days before on this list, too.
Sashidhar Guntury <sashidhar.mo...@gmail.com> schrieb: >hi > >I'm using lucene to query from wiki dump and get the categories out. >So, I >get the relevant documents and for every document, I call the below >function. > >static List<String> getCategories(Document document) throws IOException >{ > List<String> categories = new ArrayList<String>(); > String text = document.get("text"); >WikipediaTokenizer tf = new WikipediaTokenizer(new StringReader(text)); > CharTermAttribute termAtt = tf.addAttribute(CharTermAttribute.class); > TypeAttribute typeAtt = tf.addAttribute(TypeAttribute.class); > while (tf.incrementToken()) > { > String tokText = termAtt.toString(); > if (typeAtt.type().equals(WikipediaTokenizer.CATEGORY) == true) > { > categories.add(tokText); > } > } > return categories; >} > >but it throws the following error (at the while statement) > > Exception in thread "main" java.lang.NullPointerException > at >org.apache.lucene.analysis.wikipedia.WikipediaTokenizerImpl.zzRefill(WikipediaTokenizerImpl.java:574) > at >org.apache.lucene.analysis.wikipedia.WikipediaTokenizerImpl.getNextToken(WikipediaTokenizerImpl.java:781) > at >org.apache.lucene.analysis.wikipedia.WikipediaTokenizer.incrementToken(WikipediaTokenizer.java:200) > at SearchIndex.getCategories(SearchIndex.java:82) > at SearchIndex.main(SearchIndex.java:54) > >I looked at zzRefill() function but it I'm not able to understand it. >Is >this a known bug or something? I don't know what am I doing wrong. The >lucene guys says that the whole wikipediaTokenizer section is in beta >and >maybe be subject to changes. I was hoping someone could help me. > >thanks >sashidhar -- Uwe Schindler H.-H.-Meier-Allee 63, 28213 Bremen http://www.thetaphi.de