RE: Posting unicode data to lucene not working during searching/retreival!

Uwe Schindler Thu, 21 May 2009 00:12:48 -0700

Hallo KK.,

> Thanks for your quick response. Let me explain the whole thing.
> I'm downloading the pages for give urls and then extracting text and
> converting that to unicode utf-8 this way,
> 
> byte [] utfEncodeByteArray = textOnly.getBytes();
> String utfString = new String(utfEncodeByteArray, Charset.forName("UTF-
> 8"));
> 
> here textonly is the text extracted from the downloaded page, and this is
> the way i'm donwloading the pages,
> private String downloadPage(URL pageUrl) {
>         try {
>             // Open connection to URL for reading.
>             BufferedReader reader =
>                     new BufferedReader(new InputStreamReader(
>                     pageUrl.openStream()));
> 
>             // Read page into buffer.
>             String line;
>             StringBuffer pageBuffer = new StringBuffer();
>             while ((line = reader.readLine()) != null) {
>                 pageBuffer.append(line);
>             }
> 
>             return pageBuffer.toString();
>         } catch (Exception e) {
>         }
> 
>         return null;
> }
> 
> I'm I going wrong anywhere, do I've to specify the charset when opening
> hte
> bufferedReader?


You have to specify the charset when converting the InputStream to a Reader,
so specify the charset in the InputStreamReader ctor [new
InputStreamReader(InputStream,charset)]! If you not do this, the ctor would
use the default charset of your platform, which may be not UTF-8!

...

> and for searcher this is the code:
> package solrSearch;
> 
> import java.io.FileReader;
> import org.stringtree.json.JSONWriter;
> import java.util.*;
> 
> import org.apache.lucene.analysis.Analyzer;
> import org.apache.lucene.analysis.standard.StandardAnalyzer;
> import org.apache.lucene.document.Document;
> import org.apache.lucene.index.FilterIndexReader;
> import org.apache.lucene.index.IndexReader;
> import org.apache.lucene.queryParser.QueryParser;
> import org.apache.lucene.search.HitCollector;
> import org.apache.lucene.search.Hits;
> import org.apache.lucene.search.IndexSearcher;
> import org.apache.lucene.search.Query;
> import org.apache.lucene.search.ScoreDoc;
> import org.apache.lucene.search.Searcher;
> import org.apache.lucene.search.TopDocCollector;
> 
> /** Simple searcher  */
> public class SimpleSearcher {
>     private static final String baseIndexPath = "/opt/lucene/index/" ;
>     private Map resultMap = new HashMap();
> 
>     public String searchIndex(String queryString, String coreId) throws
> Exception{
>         String result = "@#";
>         String trueIndexPath = baseIndexPath + "core" + coreId;
>         String searchField = "content";
>          IndexSearcher searcher = new IndexSearcher(trueIndexPath);
>         QueryParser queryParser = null;
>         try {
>             queryParser = new QueryParser(searchField, new
> StandardAnalyzer());
>         } catch (Exception ex) {
>              ex.printStackTrace();
>         }
> 
>         Query query = queryParser.parse(queryString);
> 
>         Hits hits = null;
>         try {
>              hits = searcher.search(query);
>         } catch (Exception ex) {
>              ex.printStackTrace();
>         }
> 
>         int hitCount = hits.length();
>         System.out.println("Results found :" + hitCount);
> 
>         for (int ix=0; (ix<hitCount && ix<10); ix++) {
>              Document doc = hits.doc(ix);
>             System.out.println(doc.get("id"));
>             System.out.println(doc.get("content"));
>             result = result + doc.get("id") + "," + doc.get("content");
>             resultMap.put(doc.get("id"), doc.get("content"));
>         }
>         JSONWriter writer = new JSONWriter();
>         return writer.write(resultMap);
>         //return result;
>     }
> 
>     public static void main(String args[]) throws Exception{
>          SimpleSearcher searcher = new SimpleSearcher();
>         String queryString = args[0];
>         System.out.println("Quering for :" + queryString);
>         searcher.searchIndex(queryString, "0");
>     }
> 
> }
> NB: Please ignore improper naming conventions. indentations etc.
> Can some one point me whats going wrong. And one more thing when I tried
> to
> see the indexed docs using the LUKE, I found that the doc content contains
> one regional char and then &#2367 like this but when I clicked "show " for
> that page it showed me the true regional content wihtout any of "?" or the
> above &#... things. It seems the indexing is fine but I've to modify my
> searcher .

Is the parameter queryString created using the correct encoding (e.g. when
converting a string coming from the HTTP request).

> How to do that, any hints? Thank you very much. One more thing
> when searching throuh luke I'm able to see many results but through my
> SimpleSearcher class I'm not able to see all these results for the same
> query. What could be the reason?

Did you use the same analyzer in Luke when searching? If the query string is
incorrectly encoded, see above!

> Thanks,
> KK.
> 
> 
> 
> On Thu, May T21, 2009 at 12:03 PM, Uwe Schindler <u...@thetaphi.de> wrote:
> 
> > Indexed data is coming out in the same way as put in. Lucene works with
> > Java
> > Strings, so encoding is irrelevant. When you index your values, you must
> be
> > sure, to construct your index string/char arrays correctly using the
> UTF-8
> > encoding (e.g. by using a standard Java Reader, new String byte[],
> charset)
> > and so on. When you then print stored fields you must do the same in the
> > other direction. So the general rule: Always specify the correct charset
> > when converting to/from strings to bytes.
> > For searching: It roughly also depends also on the Analyzer used during
> > indexing and searching. Often analyzers written for specific languages
> > cannot correctly handle characters from foreign languages. But e.g.
> > StandardAnalyzer or WhitespaceAnalyzer does not modify the tokens in any
> > way
> > (if making them lowercase is not a problem).
> >
> > -----
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> > http://www.thetaphi.de
> > eMail: u...@thetaphi.de
> >
> > > -----Original Message-----
> > > From: KK [mailto:dioxide.softw...@gmail.com]
> > > Sent: Thursday, May 21, 2009 3:25 PM
> > > To: java-user@lucene.apache.org
> > > Subject: Posting unicode data to lucene not working during
> > > searching/retreival!
> > >
> > > How to post utf-8 unicoded data to lucene index. Do we have to specify
> > > something special, any sort of flag saying that we're posting unicoded
> > > data?
> > > I tried to post some utf-8 encoded data, during retrieval I'm not able
> to
> > > see those data , there are just "?" marks in all those places. Earlier
> I
> > > was
> > > using Solr and I was posting using the same method and retreival was
> also
> > > working fine, but I dont' know what is the issue with lucene, may be
> I'm
> > > missing something. Can someone tell me what could be the issue? Thank
> > you.
> > >
> > > KK,
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: Posting unicode data to lucene not working during searching/retreival!

Reply via email to