Re: SpanXXQuery Usage

2004-03-22 Thread Terry Steichen
Otis,

Can you give me/us a rough idea of what these are supposed to do?  It's hard
to extrapolate the terse unit test code into much of a general notion.  I
searched the archives with little success.

Regards,

Terry

- Original Message -
From: Otis Gospodnetic [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Monday, March 22, 2004 2:46 AM
Subject: Re: SpanXXQuery Usage


 Only in unit tests, so far.

 Otis

 --- Terry Steichen [EMAIL PROTECTED] wrote:
  Is there any documentation (other than that in the source) on how to
  use the new SpanxxQuery features?  Specifically: SpanNearQuery,
  SpanNotQuery, SpanFirstQuery and SpanOrQuery?
 
  Regards,
 
  Terry
 


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Indexing japanese PDF documents

2004-03-22 Thread Chandan Tamrakar
I am using latest PDFbox library for parsing . I can parse a english
documents successfully but when I parse a document containing english and
japanese I do not get as I expected .

Have anyone tried using PDFBox library for parsing a japanese documents ? Or
do i need to use other parser like xPDF ,Jpedal ?

Thanks in advace
Chandan



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: CJK Analyzer indexing japanese word document

2004-03-22 Thread Chandan Tamrakar
hi scott,
 Tnks for ur advise now i am using POI to convert word documents and made
sure that i convert into unicode before I put into lucene for indexing .
and working perfectly fine. Which parser is best for parsing PDF documents i
tried pdfbox but seems it doesnt work well with japanese characters
any suggestion ?

thnks
- Original Message -
From: Scott Smith [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Wednesday, March 17, 2004 4:27 AM
Subject: RE: CJK Analyzer indexing japanese word document


 I have used this analyzer with Japanese and it works fine.  In fact, I'm
 currently doing English, several western European languages, traditional
 and simplified Chinese and Japanese.  I throw them all in the same index
 and have had no problem other than my users wanted the search limited by
 language.  I solved that problem by simply adding a keyword field to the
 Document which has the 2-letter language code.  I then automatically add
 the term indicating the language as an additional constraint when the
 user specifies the search.

 You do need to be sure that the Shift-JIS gets converted to unicode
 before you put it in the Document (and pass it to the analyzer).
 Internally, I believe lucene wants everything in unicode (as any good
 java program would). Originally, I had problems with Asian languages and
 eventually determined my xml parser wasn't translating my Shift-JIS,
 Big5, etc. to unicode.  Once I fixed that, life was good.

 -Original Message-
 From: Che Dong [mailto:[EMAIL PROTECTED]
 Sent: Tuesday, March 16, 2004 8:31 AM
 To: Lucene Users List
 Subject: Re: CJK Analyzer indexing japanese word document

 some Korean friends tell me they use it successfully for Korean. So I
 think its also work for Japanese. mostly the problem is locale settings

 Please check weblucene project for xml indexing samples:
 http://sourceforge.net/projects/weblucene/

 Che Dong
 - Original Message -
 From: Chandan Tamrakar [EMAIL PROTECTED]
 To: [EMAIL PROTECTED]
 Sent: Tuesday, March 16, 2004 4:31 PM
 Subject: CJK Analyzer indexing japanese word document


 
  I am using a CJKAnalyzer from apache sandbox , I have set the java
  file.encoding setting to SJIS
  and  i am able to index and search the japanese html page . I can see
 the
  index dumps as i expected , However when i index a word document
 containing
  japanese characters it is not indexing as expected . Do I need to
 change
  anything with CJKTokenizer and CJKAnalyzer classes?
  I have been able to index a word document with StandardAnalyzers.
 
  thanks in advace
  chandan
 
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]






-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Final Hits

2004-03-22 Thread Erik Hatcher
How exactly would you take advantage of a subclassable Hits class?

On Mar 21, 2004, at 6:01 AM, Terry Steichen wrote:

Does anyone know why the Hits class is final (thus preventing it from 
being subclassed)?

Regards,

Terry


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Indexing japanese PDF documents

2004-03-22 Thread Otis Gospodnetic
I have not tried these other tools yet.
Have you asked Ben Litchfield, the PDFBox author, about handling of
Japanese text?

Otis

--- Chandan Tamrakar [EMAIL PROTECTED] wrote:
 I am using latest PDFbox library for parsing . I can parse a english
 documents successfully but when I parse a document containing english
 and
 japanese I do not get as I expected .
 
 Have anyone tried using PDFBox library for parsing a japanese
 documents ? Or
 do i need to use other parser like xPDF ,Jpedal ?
 
 Thanks in advace
 Chandan
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing japanese PDF documents

2004-03-22 Thread Ben Litchfield

Yes he did, but I was away the past couple days.  As this is more of a
PDFBox issue I responded in the PDFBox forums, please follow the thread
there if you are interested.

Ben



On Mon, 22 Mar 2004, Otis Gospodnetic wrote:

 I have not tried these other tools yet.
 Have you asked Ben Litchfield, the PDFBox author, about handling of
 Japanese text?

 Otis

 --- Chandan Tamrakar [EMAIL PROTECTED] wrote:
  I am using latest PDFbox library for parsing . I can parse a english
  documents successfully but when I parse a document containing english
  and
  japanese I do not get as I expected .
 
  Have anyone tried using PDFBox library for parsing a japanese
  documents ? Or
  do i need to use other parser like xPDF ,Jpedal ?
 
  Thanks in advace
  Chandan
 
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Demoting results

2004-03-22 Thread Boris Goldowsky
On Fri, 2004-03-19 at 11:58, Doug Cutting wrote:
 Doug Cutting wrote:
  On Thu, 2004-03-18 at 13:32, Doug Cutting wrote:
 
  Have you tried assigning these very small boosts (0  boost  1) and 
  assigning other query clauses relatively large boosts (boost  1)?
  
  I don't think you understood my proposal.  You should try boosting the 
  documents when you add them.  Instead of adding a doctype field with 
  good and bad values, use Document.setBoost(0.01) at index time.
 
 Sorry.  My mistake.  You did understand my proposal, it was just a bad 
 proposal.  Boosting documents is a better approach, but is less 
 flexible.  I think the final proposal in my previous message might be 
 the best approach (defining a custom coordination function for these 
 query clauses).

Thanks for the ideas - I love the flexibility of Lucene that there are
so many ways to accomplish what at first seemed so difficult.

Boris



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Specifation of the Key words to be searched

2004-03-22 Thread Otis Gospodnetic
Re-directing to lucene-user list.

One way of doing this is by writing a custom Analyzer that throws away
words you don't want to index (see an example of custom Analyzer in
jGuru FAQ).  Another way would be to just re-use the existing Analyzers
and add words you don't want indexed to the Analyzer's stop list.

Otis


--- jitender ahuja [EMAIL PROTECTED] wrote:
 Sir,
I am implementing lucene for a database as part of my masters'
 project. I desire to reduce the index directory size by specifying
 the key words to be indexed for the Text field specified as Reader
 type. This Key words' specification, if possible, will further reduce
 the Index directory size, but am unable to figure out how to do the
 same. 
 Kindly specify the means to achieve the same.
 
 Regards,
 Jitender


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Final Hits

2004-03-22 Thread Terry Steichen
Erik,

There are a number of different possibilities which I'm still evaluating.
But if there is some significant reason for *not* subclassing Hits
(performance?), that will have a major bearing on whether the approach I'm
evaluating makes sense.

So, let me rephrase my question: Is the final nature of Hits due to some
performance reason, or simply because no one has previously expressed any
interest in subclassing it?  Or, putting it in reverse, is there any
technical problem likely to arise from removing the final attribute(s)?

Regards,

Terry

- Original Message -
From: Erik Hatcher [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Monday, March 22, 2004 7:06 AM
Subject: Re: Final Hits


 How exactly would you take advantage of a subclassable Hits class?


 On Mar 21, 2004, at 6:01 AM, Terry Steichen wrote:

  Does anyone know why the Hits class is final (thus preventing it from
  being subclassed)?
 
  Regards,
 
  Terry


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Final Hits

2004-03-22 Thread Erik Hatcher
Terry,

I'm still quite curious how you plan to take advantage of a 
subclassable Hits.  Are you going to create your own IndexSearcher with 
returns your subclass somehow?

You could use a HitCollector (which is what is used under the covers of 
the Hits returning methods anyway) to emulate whatever it is you're 
trying to do, I suspect.

As for 'final' Doug did a great thing by designing Lucene tight and 
controlled with private/package scoped access and final modifiers in 
lots of places.  There is no technical issue with removing the final, 
but we would need to see a pretty compelling detailed reason to do so.

	Erik

On Mar 22, 2004, at 7:56 AM, Terry Steichen wrote:

Erik,

There are a number of different possibilities which I'm still 
evaluating.
But if there is some significant reason for *not* subclassing Hits
(performance?), that will have a major bearing on whether the approach 
I'm
evaluating makes sense.

So, let me rephrase my question: Is the final nature of Hits due to 
some
performance reason, or simply because no one has previously expressed 
any
interest in subclassing it?  Or, putting it in reverse, is there any
technical problem likely to arise from removing the final 
attribute(s)?

Regards,

Terry

- Original Message -
From: Erik Hatcher [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Monday, March 22, 2004 7:06 AM
Subject: Re: Final Hits

How exactly would you take advantage of a subclassable Hits class?

On Mar 21, 2004, at 6:01 AM, Terry Steichen wrote:

Does anyone know why the Hits class is final (thus preventing it from
being subclassed)?
Regards,

Terry


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


code works with 1.3-rc1 but not with 1.3-final??

2004-03-22 Thread Dan
I have some code that creates a lucene index. It has been working fine 
with lucene-1.3-rc1.jar but I wanted to upgrade to lucene-1.3-final.jar. 
I did this and the indexer breaks. I get the following error when 
running the index with 1.3-final:

Optimizing the index
IOException: /home/danl001/index-Mar-22-14_31_30/_ni.f43 (Too many open 
files)
Indexed 884 files in 8 directories
Index creation took 242 seconds
%

So it appears the the code that uses 1.3-final breaks on the call to 
optimize(). Does anyone know what is wrong?

Again, the ONLY change between the working version and the version that 
breaks on optimize is the jar file I use. lucene-1.3-rc1.jar works. 
lucene-1.3-final.jar doesnt. Wierd huh?

I've tested this on both Unix (solaris) and on windows. In both cases, 
I'm using jdk 1.4.2_03.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


termPosition does not iterate properly in Lucene 1.3 rc1

2004-03-22 Thread Allen Atamer
Lucene does not iterate through the termPositions on one of my indexed data
sources. It used to iterate properly through this data source, but not
anymore. I tried on a different indexed data source and it iterates
properly. The Lucene index directory does not have any lock files either.

My code is as follows

TermPositions termPos = reader.termPositions(aTerm);
while (termPos.next()) {
// get doc
String docID = reader.document(termPos.doc()).get(keyName);
...
}

Is there anything wrong with that? Thanks for your help,

Allen


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: code works with 1.3-rc1 but not with 1.3-final??

2004-03-22 Thread Kevin A. Burton
Dan wrote:

I have some code that creates a lucene index. It has been working fine 
with lucene-1.3-rc1.jar but I wanted to upgrade to 
lucene-1.3-final.jar. I did this and the indexer breaks. I get the 
following error when running the index with 1.3-final:

Optimizing the index
IOException: /home/danl001/index-Mar-22-14_31_30/_ni.f43 (Too many 
open files)
Indexed 884 files in 8 directories
Index creation took 242 seconds
%

No... it's you... ;)

Read the FAQ and then run

ulimit -n 100 or so...

You need to increase your file handles.  Chance are you never noticed 
this before but the problem was still present.  If you're on a Linux box 
you would be amazed to find out that you're only about 200 file handles 
away from running out of your per-user quota file quota.

You might have to su as root to change this.. RedHat is more strict 
because it uses the glibc resource restrictions thingy. (who's name 
slips my mind at the moment). 

Debian is configured better here as per defaults.

Also a google query would have solved this for you very quickly ;)..

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature


Re: code works with 1.3-rc1 but not with 1.3-final??

2004-03-22 Thread Matt Quail
Or use IndexWriter.setUseCompundFile(true) to reduce the number of files
created by Lucene.
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexWriter.html#setUseCompoundFile(boolean)

=Matt

Kevin A. Burton wrote:

Dan wrote:

I have some code that creates a lucene index. It has been working fine 
with lucene-1.3-rc1.jar but I wanted to upgrade to 
lucene-1.3-final.jar. I did this and the indexer breaks. I get the 
following error when running the index with 1.3-final:

Optimizing the index
IOException: /home/danl001/index-Mar-22-14_31_30/_ni.f43 (Too many 
open files)
Indexed 884 files in 8 directories
Index creation took 242 seconds
%

No... it's you... ;)

Read the FAQ and then run

ulimit -n 100 or so...

You need to increase your file handles.  Chance are you never noticed 
this before but the problem was still present.  If you're on a Linux box 
you would be amazed to find out that you're only about 200 file handles 
away from running out of your per-user quota file quota.

You might have to su as root to change this.. RedHat is more strict 
because it uses the glibc resource restrictions thingy. (who's name 
slips my mind at the moment).
Debian is configured better here as per defaults.

Also a google query would have solved this for you very quickly ;)..

Kevin





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Lock timeout should show the index it failed on...

2004-03-22 Thread Kevin A. Burton
Just an RFE... if a lock times out we should probably throw the name of 
the FSDirectory (or if it's a RAMDirectory) ...

I'm lazy so this is a reminder for either myself to do this or wait 
until one of you guys take care of it :)

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature