Hallo KK., > Thanks for your quick response. Let me explain the whole thing. > I'm downloading the pages for give urls and then extracting text and > converting that to unicode utf-8 this way, > > byte [] utfEncodeByteArray = textOnly.getBytes(); > String utfString = new String(utfEncodeByteArray, Charset.forName("UTF- > 8")); > > here textonly is the text extracted from the downloaded page, and this is > the way i'm donwloading the pages, > private String downloadPage(URL pageUrl) { > try { > // Open connection to URL for reading. > BufferedReader reader = > new BufferedReader(new InputStreamReader( > pageUrl.openStream())); > > // Read page into buffer. > String line; > StringBuffer pageBuffer = new StringBuffer(); > while ((line = reader.readLine()) != null) { > pageBuffer.append(line); > } > > return pageBuffer.toString(); > } catch (Exception e) { > } > > return null; > } > > I'm I going wrong anywhere, do I've to specify the charset when opening > hte > bufferedReader?
You have to specify the charset when converting the InputStream to a Reader, so specify the charset in the InputStreamReader ctor [new InputStreamReader(InputStream,charset)]! If you not do this, the ctor would use the default charset of your platform, which may be not UTF-8! ... > and for searcher this is the code: > package solrSearch; > > import java.io.FileReader; > import org.stringtree.json.JSONWriter; > import java.util.*; > > import org.apache.lucene.analysis.Analyzer; > import org.apache.lucene.analysis.standard.StandardAnalyzer; > import org.apache.lucene.document.Document; > import org.apache.lucene.index.FilterIndexReader; > import org.apache.lucene.index.IndexReader; > import org.apache.lucene.queryParser.QueryParser; > import org.apache.lucene.search.HitCollector; > import org.apache.lucene.search.Hits; > import org.apache.lucene.search.IndexSearcher; > import org.apache.lucene.search.Query; > import org.apache.lucene.search.ScoreDoc; > import org.apache.lucene.search.Searcher; > import org.apache.lucene.search.TopDocCollector; > > /** Simple searcher */ > public class SimpleSearcher { > private static final String baseIndexPath = "/opt/lucene/index/" ; > private Map resultMap = new HashMap(); > > public String searchIndex(String queryString, String coreId) throws > Exception{ > String result = "@#"; > String trueIndexPath = baseIndexPath + "core" + coreId; > String searchField = "content"; > IndexSearcher searcher = new IndexSearcher(trueIndexPath); > QueryParser queryParser = null; > try { > queryParser = new QueryParser(searchField, new > StandardAnalyzer()); > } catch (Exception ex) { > ex.printStackTrace(); > } > > Query query = queryParser.parse(queryString); > > Hits hits = null; > try { > hits = searcher.search(query); > } catch (Exception ex) { > ex.printStackTrace(); > } > > int hitCount = hits.length(); > System.out.println("Results found :" + hitCount); > > for (int ix=0; (ix<hitCount && ix<10); ix++) { > Document doc = hits.doc(ix); > System.out.println(doc.get("id")); > System.out.println(doc.get("content")); > result = result + doc.get("id") + "," + doc.get("content"); > resultMap.put(doc.get("id"), doc.get("content")); > } > JSONWriter writer = new JSONWriter(); > return writer.write(resultMap); > //return result; > } > > public static void main(String args[]) throws Exception{ > SimpleSearcher searcher = new SimpleSearcher(); > String queryString = args[0]; > System.out.println("Quering for :" + queryString); > searcher.searchIndex(queryString, "0"); > } > > } > NB: Please ignore improper naming conventions. indentations etc. > Can some one point me whats going wrong. And one more thing when I tried > to > see the indexed docs using the LUKE, I found that the doc content contains > one regional char and then ि like this but when I clicked "show " for > that page it showed me the true regional content wihtout any of "?" or the > above &#... things. It seems the indexing is fine but I've to modify my > searcher . Is the parameter queryString created using the correct encoding (e.g. when converting a string coming from the HTTP request). > How to do that, any hints? Thank you very much. One more thing > when searching throuh luke I'm able to see many results but through my > SimpleSearcher class I'm not able to see all these results for the same > query. What could be the reason? Did you use the same analyzer in Luke when searching? If the query string is incorrectly encoded, see above! > Thanks, > KK. > > > > On Thu, May T21, 2009 at 12:03 PM, Uwe Schindler <u...@thetaphi.de> wrote: > > > Indexed data is coming out in the same way as put in. Lucene works with > > Java > > Strings, so encoding is irrelevant. When you index your values, you must > be > > sure, to construct your index string/char arrays correctly using the > UTF-8 > > encoding (e.g. by using a standard Java Reader, new String byte[], > charset) > > and so on. When you then print stored fields you must do the same in the > > other direction. So the general rule: Always specify the correct charset > > when converting to/from strings to bytes. > > For searching: It roughly also depends also on the Analyzer used during > > indexing and searching. Often analyzers written for specific languages > > cannot correctly handle characters from foreign languages. But e.g. > > StandardAnalyzer or WhitespaceAnalyzer does not modify the tokens in any > > way > > (if making them lowercase is not a problem). > > > > ----- > > Uwe Schindler > > H.-H.-Meier-Allee 63, D-28213 Bremen > > http://www.thetaphi.de > > eMail: u...@thetaphi.de > > > > > -----Original Message----- > > > From: KK [mailto:dioxide.softw...@gmail.com] > > > Sent: Thursday, May 21, 2009 3:25 PM > > > To: java-user@lucene.apache.org > > > Subject: Posting unicode data to lucene not working during > > > searching/retreival! > > > > > > How to post utf-8 unicoded data to lucene index. Do we have to specify > > > something special, any sort of flag saying that we're posting unicoded > > > data? > > > I tried to post some utf-8 encoded data, during retrieval I'm not able > to > > > see those data , there are just "?" marks in all those places. Earlier > I > > > was > > > using Solr and I was posting using the same method and retreival was > also > > > working fine, but I dont' know what is the issue with lucene, may be > I'm > > > missing something. Can someone tell me what could be the issue? Thank > > you. > > > > > > KK, > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org