Re: Help in Arabic Analysers with Lucene on Windows

Grant Ingersoll Mon, 29 Dec 2008 07:50:36 -0800


On Dec 29, 2008, at 9:59 AM, Girish Naik wrote:

FIELD_BODY is defined as
public static final String FIELD_BODY = "AVS_FIELD_BODY";
and its indexed as
 ParsedDoc webdoc = ParsedDoc.getDoc(page);
...

document.add(new Field(Constants.FIELD_BODY, webdoc.getContents(),Field.Store.NO, Field.Index.ANALYZED));


-x-x-x-
public static ParsedDoc getDoc(Page page) {
                try {
                        String raw = page.getContentType();
                        int semicolon = raw.indexOf(";");
                        if (semicolon > -1) {
                                raw = raw.substring(0, semicolon);      
                        }
                        MimeType mime = new MimeType(raw);
                        String contentType = mime.getBaseType();
                        String classname = getImpl(contentType);
                        if (classname == null) {
                                classname = getImpl(mime.getPrimaryType());
                                if (classname == null) {
                                        return null;    
                                }
                        }
                        Class webdocClass = Class.forName(classname);
                        ParsedDoc webdoc = (ParsedDoc) 
webdocClass.newInstance();
                        webdoc.page = page;
                        webdoc.contentType = contentType;
                        return webdoc;
                } catch (Exception e) {
                        _log.error("Eror while parsing file: " + page.toURL());
                        throw new SysException(e.getMessage());
                }
-x-x-x-

And Luke is not able to open the Indexed files by Lucene currently.But on my colleague's System it opened but no arabic content wasfound instead some chanrecters like Ø§Ù„Ø§Ø³ØªØ±Ø§ØªÙŠØ¬ .. etcwere found.

I am now pretty sure it is an encoding issue. I'm guessing thathowever you are getting the page, it is not in the right encoding.How do you obtain the Page object? I'm guessing you are crawling.You need to make sure you are getting the encoding of the file andopening it with that encoding.


Something like:

Reader reader = new InputStreamReader(new FileInputStream(file),encoding);


where encoding is the encoding of the file.

In my local now its giving 'Unknown format version: -8' as it wasgiving when my colleague tried to open and index from a Linux systemwhere search was working fine.

What version of Lucene are you using and what version of Luke? Thisusually happens, I believe, when the Luke version is older than theLucene version used to create the index.










---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Help in Arabic Analysers with Lucene on Windows

Reply via email to