Hi Itamar,
I didn't put the m_GetFileContents function here because I have tested text
after being readed for this function and text are correct.
What you mean with "Index UTF8 / Unicode encoded files instead of your ANSI
ones" ? There are some index configuration for doing this? Or you are talking
about files we are indexing? If is about files we are indexing, anything
changes if file we are indexing is UTF8 / Unicode instead of ANSI.
Thanks & Regards,
Rui
From: ita...@divrei-tora.com
To: clucene-developers@lists.sourceforge.net
Date: Mon, 26 Apr 2010 21:23:56 +0300
Subject: Re: [CLucene-dev] Clucene search - Do not found some words
CString is MFC's string object, and is TCHAR.
Rui, the function we are actually interested in is m_GetFileContents. The error
most likely lies there, in the way you are loading your text documents (which
we already established are ANSI). Please also let us know how you compile your
app with (MBCS or Unicode). In the meantime, try two more things:
Index UTF8 / Unicode encoded files instead of your ANSI ones.
Use SimpleAnalyzer instead of Stanadard. StandardAnalyzer is meant primarily
for English texts, and might be incompatible for accented letters. See
cl_test::TestAnalyzers.cpp (esp. testISOLatin1AccentFilter) -- try perhaps
playing with it a bit to see if it is an issue with CLucene with or your own
code.
HTH
Itamar.
From: Onilton Maciel [mailto:oniltonmac...@gmail.com]
Sent: Monday, April 26, 2010 5:20 PM
To: clucene-developers@lists.sourceforge.net
Subject: Re: [CLucene-dev] Clucene search - Do not found some words
ls_text shouldn't be TCHAR?
(I'm asking other people reading this thread)
On Mon, Apr 26, 2010 at 9:58 AM, Rui Oliveira <ruifra...@hotmail.com> wrote:
void c_IndexEx::m_Add(CString avs_codRevsId)
{
CString ls_origem = "c_IndexEx::m_Add";
try
{
m_InitVariables();
if(!ii_enmIndx)
return;
IndexWriter* writer = NULL;
lucene::analysis::standard::StandardAnalyzer an;
if ( IndexReader::indexExists(iclp_indexPath) ){
if ( IndexReader::isLocked(iclp_indexPath) )
{
m_AppendLog("Index was locked... unlocking it.");
IndexReader::unlock(iclp_indexPath);
}
writer = _CLNEW IndexWriter( iclp_indexPath, &an, false);
}
else
{
writer = _CLNEW IndexWriter( iclp_indexPath ,&an, true);
}
writer->setMaxFieldLength(IndexWriter::DEFAULT_MAX_FIELD_LENGTH);
writer->setUseCompoundFile(true);
uint64_t str = lucene::util::Misc::currentTimeMillis();
// make a new, empty document
Document* lcl_doc = _CLNEW Document();
if(m_FileDocument( avs_codRevsId, lcl_doc ))
{
writer->addDocument( lcl_doc );
}
_CLDELETE(lcl_doc);
writer->optimize();
writer->close();
_CLDELETE(writer);
}
catch(CLuceneError& err)
{
// e->Delete();
return;
}
catch( CException* e )
{
// e->Delete();
m_AppendLog(ls_origem);
return;
}
catch(...)
{
// e->Delete();
return;
}
}
BOOL c_IndexEx::m_FileDocument(CString avs_codRevsId, Document* arcl_doc)
{
// make a new, empty document
CString ls_codDocmId;
CString ls_Path = m_GetFilePath(avs_codRevsId, &ls_codDocmId);
if(ls_Path.IsEmpty())
{
return FALSE;
}
char* lcl_Path = NULL;
lcl_Path = new char[ls_Path.GetLength()+1];
_tcscpy(lcl_Path, ls_Path);
CString ls_text;
m_GetFileContents(lcl_Path, &ls_text);
arcl_doc->add( *_CLNEW Field(_T("contents"), ls_text, Field::STORE_YES |
Field::INDEX_TOKENIZED) );
icl_file.m_DeleteFile(ls_Path);
// return the document
delete lcl_Path;
return TRUE;
}
From: oniltonmac...@gmail.com
Date: Mon, 26 Apr 2010 10:36:45 -0300
To: clucene-developers@lists.sourceforge.net
Subject: Re: [CLucene-dev] Clucene search - Do not found some words
Can you send the code where you index?
On Mon, Apr 26, 2010 at 9:55 AM, Rui Oliveira <ruifra...@hotmail.com> wrote:
How can I check this?
I just get text from files to a CString, and after this put them in CLucene.
Apparently, the text I get from file to CString it is right, I have checked in
degub mode and looks good.
Rui
> Date: Mon, 26 Apr 2010 14:44:56 +0200
> From: nuncupa...@googlemail.com
> To: clucene-developers@lists.sourceforge.net
> Subject: Re: [CLucene-dev] Clucene search - Do not found some words
>
> Rui,
>
> which encoding do you use internally before you give it to CLucene?
> Maybe you use an encoding different to the encoding expected by
> CLucene.
>
> Kind regards,
>
> Veit
>
> 2010/4/26 Rui Oliveira <ruifra...@hotmail.com>:
> > Hi,
> >
> > I have been using luke to analyze index.
> >
> > Well, all Portuguese characters appear replaced by an strange character.
> >
> > What I can do to avoid this?
> > It is not possible make clucene working with Portuguese characters?
> >
> > Thanks & Regards,
> > Rui
> >
> >
> >
> >> Date: Fri, 23 Apr 2010 20:43:49 +0200
> >> From: bvanklin...@gmail.com
> >> To: clucene-developers@lists.sourceforge.net
> >> Subject: Re: [CLucene-dev] Clucene search - Do not found some words
> >>
> >> I suggest using a program called luke (google it). You can then look
> >> into the index and see what is indexed. Let us know if u see all the
> >> words you would expect to see. And see if u can find the document if u
> >> search from luke
> >>
> >> handy program :)
> >>
> >> cheers
> >> ben
> >>
> >> On Friday, April 23, 2010, Rui Oliveira <ruifra...@hotmail.com> wrote:
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > Itamar,
> >> >
> >> > The test results are made all them in same file. The same file have
> >> > "orçamento" and "administração" and found "administração" and do not
> >> > found
> >> > "orçamento".
> >> >
> >> > The results are the same for a file in ANSI, Unicode or UTF8 encoded.
> >> > The problem is not loading files because I debug the text loaded from
> >> > file
> >> > and this text are ok.
> >> >
> >> > Rui
> >> >
> >> >
> >> >
> >> >
> >> > From: ita...@divrei-tora.com
> >> > To: clucene-developers@lists.sourceforge.net
> >> > Date: Fri, 23 Apr 2010 17:59:27 +0300
> >> > Subject: Re: [CLucene-dev] Clucene search - Do not found some words
> >> >
> >> > Rui,
> >> >
> >> > This file is ANSI encoded. Are the other files you do succeed in finding
> >> > are Unicode / UTF8 encoded perhaps? If that's the case your routine for
> >> > loading the files is buggy. You should either have them all encoded using
> >> > the same encoding, or have more intelligent code to convert incompatible
> >> > encoding.
> >> >
> >> > HTH
> >> >
> >> > Itamar.
> >> >
> >> >
> >> > From: Rui Oliveira [mailto:ruifra...@hotmail.com]
> >> > Sent: Friday, April 23, 2010 4:32 PM
> >> > To: clucene-developers; oniltonmac...@gmail.com
> >> > Subject: Re: [CLucene-dev] Clucene search - Do not found some words
> >> >
> >> >
> >> > I just attach the file.
> >> >
> >> > Tks, Rui
> >> >
> >> >
> >> > From: oniltonmac...@gmail.com
> >> > Date: Fri, 23 Apr 2010 09:22:05 -0400
> >> > To: clucene-developers@lists.sourceforge.net
> >> > Subject: Re: [CLucene-dev] Clucene search - Do not found some words
> >> >
> >> > Can you send me this file that has both "orçamento" and administração?
> >> >
> >> > Or you can do a test: Open the file and delete the ç form orçamento and
> >> > administração.
> >> > And then type ç again.
> >> >
> >> > Index again and try to search both words again.
> >> >
> >> > On Fri, Apr 23, 2010 at 9:14 AM, Rui Oliveira <ruifra...@hotmail.com>
> >> > wrote:
> >> >
> >> > They are text file (*.txt) and both words are in same document.
> >> > When I search for "orçamento" don't found anything and when I search for
> >> > "administração" the document is found.
> >> >
> >> >
> >> > Rui
> >> >
> >> >
> >> > From: oniltonmac...@gmail.com
> >> > Date: Fri, 23 Apr 2010 09:09:30 -0400
> >> >
> >> >
> >> >
> >> > To: clucene-developers@lists.sourceforge.net
> >> > Subject: Re: [CLucene-dev] Clucene search - Do not found some words
> >> >
> >> > Seems like an encoding problem with these documents. Are they html
> >> > pages?
> >> > Are the words "orçamento" and "administração" in the same page? for
> >> > example?
> >> >
> >> > Can you dump one of these files here? (One that has the problem and one
> >> > that has not)
> >> >
> >> >
> >> > On Fri, Apr 23, 2010 at 9:05 AM, Rui Oliveira <ruifra...@hotmail.com>
> >> > wrote:
> >> >
> >> > I am indexing some separated documents.
> >> >
> >> > The document that have these words are a small text document. This
> >> > document is indexed without any visible error. This same document is
> >> > found
> >> > when I search for other words on it.
> >> >
> >> >
> >> > Rui
> >> >
> >> >
> >> > From: oniltonmac...@gmail.com
> >> > Date: Fri, 23 Apr 2010 08:58:05 -0400
> >> >
> >> >
> >> >
> >> > To: clucene-developers@lists.sourceforge.net
> >> > Subject: Re: [CLucene-dev] Clucene search - Do not found some words
> >> >
> >> > What are you indexing?
> >> >
> >> > Just a big document?
> >> > Or a lot of sepparate documents ? (html documents?)
> >> >
> >> > On Fri, Apr 23, 2010 at 8:54 AM, Rui Oliveira <ruifra...@hotmail.com>
> >> > wrote:
> >> >
> >> > Hi Onilton,
> >> >
> >> > I have tested with "orcamento" instead of "orçamento" and didn't get
> >> > anything.
> >> >
> >> > I do not know if lucene indexes "orçamento" in a wrong way, because
> >> > indexes without any error, but when I search for it do not get anything.
> >> >
> >> > Thnaks & Regards,
> >> > Rui
> >> >
> >> >
> >> > From:
> >> >
> >>
> >>
> >> ------------------------------------------------------------------------------
> >> _______________________________________________
> >> CLucene-developers mailing list
> >> CLucene-developers@lists.sourceforge.net
> >> https://lists.sourceforge.net/lists/listinfo/clucene-developers
> >
> > ________________________________
> > Hotmail has tools for the New Busy. Search, chat and e-mail from your inbox.
> > Learn more.
> > ------------------------------------------------------------------------------
> >
> > _______________________________________________
> > CLucene-developers mailing list
> > CLucene-developers@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/clucene-developers
> >
> >
>
> ------------------------------------------------------------------------------
> _______________________________________________
> CLucene-developers mailing list
> CLucene-developers@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/clucene-developers
The New Busy think 9 to 5 is a cute idea. Combine multiple calendars with
Hotmail. Get busy.
------------------------------------------------------------------------------
_______________________________________________
CLucene-developers mailing list
CLucene-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/clucene-developers
The New Busy think 9 to 5 is a cute idea. Combine multiple calendars with
Hotmail. Get busy.
------------------------------------------------------------------------------
_______________________________________________
CLucene-developers mailing list
CLucene-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/clucene-developers
_________________________________________________________________
Hotmail has tools for the New Busy. Search, chat and e-mail from your inbox.
http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_1
------------------------------------------------------------------------------
_______________________________________________
CLucene-developers mailing list
CLucene-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/clucene-developers