calculate wi = tfi * IDFi for each document.

2005-06-02 Thread Andrew Boyd
If I have search results how can I calculate, using lucene's API,  wi = tfi * 
IDFi for each document.

wi= term weight
tfi= term frequency in a document
IDFi = inverse document frequency = log(D/dfi)
dfi   = document frequency or number of documents containing term i
D= number of documents in my search result

Thanks,

Andrew

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Lucene and Documentum

2005-06-02 Thread Andrew Boyd
Hi All,
  Has anyone had any experience using lucene to search a documentum respoitory?

Thanks

Andrew

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: how long should optimizing take

2005-06-02 Thread Angelov, Rossen
Thanks for the suggestion, Jian Chen's idea is very similar too.
Probably optimizing that often is not necessary and not that critical for
speeding up the searches.

I'll try changing the index process not to optimize at all and execute the
optimization independently of the indexing on a weekly bases.

Ross

-Original Message-
From: Dan Armbrust [mailto:[EMAIL PROTECTED]
Sent: Thursday, June 02, 2005 11:10 AM
To: java-user@lucene.apache.org
Subject: Re: how long should optimizing take


I would run your optimize process in a separate thread, so that your web 
client doesn't have to wait for it to return.

You may even want to set the optimize part up to run on a weekly 
schedule, at a low load time.  I probably wouldn't reoptimize after 
every 30 documents, on an index that size.

Optimizing takes a while on your index, because it basically has to copy 
the entire index to a new index, so it will take how ever long it takes 
to copy 2 GB's on your hardware + a small amount of overhead...

Dan

Angelov, Rossen wrote:

I would like to bring that issue up again as I haven't resolved it yet and
haven't found what's causing it.

Any help, ideas or sharing experience are welcome!

Thanks,
Ross

-Original Message-
From: Angelov, Rossen 
Sent: Friday, May 27, 2005 10:42 AM
To: 'java-user@lucene.apache.org'
Subject: how long should optimizing take


Hi,

I'm having problems with the Lucene optimization. Two of the indexes are
about 2BG big and every day about 30 documents are added to each of these
indexes. At the end of the indexing the IndexWriter optimize() method is
executed and it takes about 30 minutes to finish the optimization for each
index.

The indexing happens through a web service. A servlet takes an http request
and executes methods to index the new documents and optimize the indexes.

The problem is that the request takes too long to finish because of the
optimization and the web server doesn't return a response. The browser will
keep waiting forever.

Has anybody else experienced similar behavior with the optimization
process?

Thanks,
Ross

This communication is intended solely for the addressee and is
confidential and not for third party unauthorized distribution.



This communication is intended solely for the addressee and is
confidential and not for third party unauthorized distribution.


  


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


This communication is intended solely for the addressee and is
confidential and not for third party unauthorized distribution.



RE: Indexing multiple languages

2005-06-02 Thread Tansley, Robert
Thanks all for the useful comments.

It seems that there are even more options --

4/ One index, with a separate Lucene document for each (item,language) 
combination, with one field that specifies the language
5/ One index, one Lucene document per item, with field names that include the 
language (e.g. title_en, title_cn)

I quite like 4, because you can search with no language constraint, or with one 
as Paul suggests below.  However, some non language-specific data might need 
to be repeated (e.g. dates), unless we had an extra Lucene document for all 
that.  I wonder what the various pros and cons in terms of index size and 
performance would be in each case?  I really don't have enough knowledge of 
Lucene to have any idea...

 Robert Tansley / Digital Media Systems Programme / HP Labs
  http://www.hpl.hp.com/personal/Robert_Tansley/

 -Original Message-
 From: Paul Libbrecht [mailto:[EMAIL PROTECTED] 
 Sent: 01 June 2005 04:10
 To: java-user@lucene.apache.org
 Subject: Re: Indexing multiple languages
 
 Le 1 juin 05, à 01:12, Erik Hatcher a écrit :
  1/ one index for all languages
  2/ one index for all languages, with an extra language field so 
  searches
  can be constrained to a particular language
  3/ separate indices for each language?
  I would vote for option #2 as it gives the most flexibilty 
 - you can 
  query with or without concern for language.
 
 The way I've solved this is to make a different field-name 
 per-language 
 as our documents can be multilingual.
 What's then done is query expansion at query time: given a term-query 
 for text, I duplicate it for each accepted language of the 
 user with a 
 factor related to the preference of the language (e.g. the q 
 factor in 
 Accept-Language http header). Presumably I could be using solution 2/ 
 as well if my queries become too big, making several 
 documents for each 
 language of the document.
 
 I think it's very important to care about guessing the accepted 
 languages of the user. Typically, the default behaviour of 
 Google is to 
 only give you matches in your primary language but then allow 
 expansion 
 in any language.
 
  On the other hand, if people are searching for proper nouns in 
  metadata
  (e.g. DSpace) it may be advantageous to search all languages at 
  once.
 
 This one may need particular treatment.
 
 Tell us your success!
 
 paul
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: how long should optimizing take

2005-06-02 Thread Angelov, Rossen
I'll make sure no indexing is started before the optimization is done.
Most likely Sunday will be the optimization day for the indexes and every
other night the documents will be added to the index.

Only searching will be available through the web service while optimizing,
but this should not be a problem as an IndexReader will be opened, not a
second IndexWriter.

Ross

-Original Message-
From: Dan Armbrust [mailto:[EMAIL PROTECTED]
Sent: Thursday, June 02, 2005 3:10 PM
To: java-user@lucene.apache.org
Subject: Re: how long should optimizing take


You should be careful, however, not to end up with two VM instances each 
trying to open an index writer at the same time - one of them is going 
to fail.

Aka, if someone using your web interface tries to add a new document to 
the index while you have the optimizer running standalone, the web 
interface is not going to be able to get a lock on the index to add the 
documents. 

Dan

Angelov, Rossen wrote:

Thanks for the suggestion, Jian Chen's idea is very similar too.
Probably optimizing that often is not necessary and not that critical for
speeding up the searches.

I'll try changing the index process not to optimize at all and execute the
optimization independently of the indexing on a weekly bases.

Ross

  

  


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


This communication is intended solely for the addressee and is
confidential and not for third party unauthorized distribution.



Re: Indexing and Hit Highlighting OCR Data

2005-06-02 Thread Chris Hostetter

This is a pretty interesting problem.  I envy you.

I would avoid the existing highlighter for your purposes -- highlighting
in token space is a very differnet problem from highlihgting in 2D
space.

based on the XML sample you provided, it looks like your XML files
are allready a tokenized form of the orriginal OCR data -- by which i
mean the page has allready been tokenized into words who position is
recorded.

I would parse these XML docs to generate two things:
1) a stream of words for analysis/filtering (ie: stop words, stemming,
   synonyms)
2) a datastructure mapping words to lists of positions (ie: if the
   same word apears in multiple places, list the word once, followed
   by each set of coordinates)

use #1 in the usual way, and add a serialized form of #2 to your index as
a Stored Keyword -- at query time, the words from your initial query can
be looked up in that data strucutre to find the regions to highlight



: I am involved in a project which is trying to provide searching and hit 
highlighting on the scanned image of historical newspapers.  We have an XML 
based OCR format.  A sample is below.  We need to index the CONTENT attribute 
of the String element which is the easy part.  We would like to be able find 
the hits within this XML document in order to use the positioning information 
to draw the highlight boxes on the image.  It doesn't make a lot of sense to 
just extract the CONTENT and index that because we loose the positioning 
information.  My second thought was to make a custom analyzer which dropped 
everything except for the content element and then used the highlighting class 
in the sandbox to reanalyze the XML document and mark the hits.  With the 
marked hits in the XML we could find the position information and draw on the 
image.  Has anyone else worked with OCR information and lucene.  What was your 
approach?  Does this approach seem sound?  Any recommendations?
:
: Thanks, Corey
:
:  TextLine HEIGHT=2307.0 WIDTH=2284.0 HPOS=1316.0 VPOS=123644.0
:   String STYLEREFS=ID4 HEIGHT=1922.0 WIDTH=244.0 HPOS=1316.0 
VPOS=123644.0 CONTENT=The WC=1.0/
:   SP WIDTH=-244.0 HPOS=1560.0 VPOS=123644.0/
:   String STYLEREFS=ID4 HEIGHT=1914.0 WIDTH=424.0 HPOS=1664.0 
VPOS=123711.0 CONTENT=female WC=1.0/
:   SP WIDTH=184.0 HPOS=1480.0 VPOS=123644.0/
:   String STYLEREFS=ID4 HEIGHT=2174.0 WIDTH=240.0 HPOS=2192.0 
VPOS=123711.0 CONTENT=lays WC=1.0/
:   SP WIDTH=104.0 HPOS=2088.0 VPOS=123711.0/
:   String STYLEREFS=ID4 HEIGHT=1981.0 WIDTH=360.0 HPOS=2528.0 
VPOS=123711.0 CONTENT=about WC=1.0/
:   SP WIDTH=236.0 HPOS=2292.0 VPOS=123711.0/
:   String STYLEREFS=ID4 HEIGHT=1855.0 WIDTH=216.0 HPOS=3000.0 
VPOS=123770.0 CONTENT=140 WC=1.0/
:   SP WIDTH=112.0 HPOS=2888.0 VPOS=123711.0/
:   String STYLEREFS=ID4 HEIGHT=1729.0 WIDTH=284.0 HPOS=3316.0 
VPOS=124223.0 CONTENT=eggs WC=1.0/
:   SP WIDTH=100.0 HPOS=3216.0 VPOS=123770.0/
:  /TextLine
:
:
:



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Indexing multiple languages

2005-06-02 Thread Bob Cheung
Hi Erik,

I am a new comer to this list and please allow me to ask a dumb
question.

For the StandardAnalyzer, will it have to be modified to accept
different character encodings.

We have customers in China, Taiwan and Hong Kong.  Chinese data may come
in 3 different encoding:  Big5, GB and UTF8.

What is the default encoding for the StandardAnalyser.

Btw, I did try running the lucene demo (web template) to index the HTML
files after I added one including English and Chinese characters.  I was
not able to search for any Chinese in that HTML file (returned no hits).
I wonder whether I need to change some of the java programs to index
Chinese and/or accept Chinese as search term.  I was able to search for
the HTML file if I used English word that appeared in the added HTML
file.

Thanks,

Bob


On May 31, 2005, Erik wrote:

Jian - have you tried Lucene's StandardAnalyzer with Chinese?  It
will keep English as-is (removing stop words, lowercasing, and such)
and separate CJK characters into separate tokens also.

 Erik


On May 31, 2005, at 5:49 PM, jian chen wrote:

 Hi,

 Interesting topic. I thought about this as well. I wanted to index
 Chinese text with English, i.e., I want to treat the English text
 inside Chinese text as English tokens rather than Chinese text tokens.

 Right now I think maybe I have to write a special analyzer that takes
 the text input, and detect if the character is an ASCII char, if it
 is, assembly them together and make it as a token, if not, then, make
 it as a Chinese word token.

 So, bottom line is, just one analyzer for all the text and do the
 if/else statement inside the analyzer.

 I would like to learn more thoughts about this!

 Thanks,

 Jian

 On 5/31/05, Tansley, Robert [EMAIL PROTECTED] wrote:

 Hi all,

 The DSpace (www.dspace.org) currently uses Lucene to index metadata
 (Dublin Core standard) and extracted full-text content of documents
 stored in it.  Now the system is being used globally, it needs to
 support multi-language indexing.

 I've looked through the mailing list archives etc. and it seems it's
 easy to plug in analyzers for different languages.

 What if we're trying to index multiple languages in the same
 site?  Is
 it best to have:

 1/ one index for all languages
 2/ one index for all languages, with an extra language field so
 searches
 can be constrained to a particular language
 3/ separate indices for each language?

 I don't fully understand the consequences in terms of performance for
 1/, but I can see that false hits could turn up where one word
 appears
 in different languages (stemming could increase the changes of this).
 Also some languages' analyzers are quite dramatically different (e.g.
 the Chinese one which just treats every character as a separate
 token/word).

 On the other hand, if people are searching for proper nouns in
 metadata
 (e.g. DSpace) it may be advantageous to search all languages at
 once.


 I'm also not sure of the storage and performance consequences of 2/.

 Approach 3/ seems like it might be the most complex from an
 implementation/code point of view.

 Does anyone have any thoughts or recommendations on this?

 Many thanks,

  Robert Tansley / Digital Media Systems Programme / HP Labs
   http://www.hpl.hp.com/personal/Robert_Tansley/

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]