indexing pdfs

2007-03-08 Thread ashwin kumar
hi can some one help me by giving any sample programs for indexing pdfs and .doc files thanks regards ashwin

RE: indexing pdfs

2007-03-08 Thread Kainth, Sachin
Hi Aswin, You can try pdfbox to convert the pdf documents to text and then use Lucene to index the text. The code for turning a pdf to text is very simple: private static string parseUsingPDFBox(string filename) { // document reader PDDocument doc =

Re: indexing pdfs

2007-03-08 Thread Ulf Dittmer
For DOC files you can use the Jakarta POI library. Text extraction is outlined here: http://jakarta.apache.org/poi/hwpf/quick-guide.html Ulf On 08.03.2007, at 10:37, ashwin kumar wrote: hi can some one help me by giving any sample programs for indexing pdfs and .doc files

Re: indexing pdfs

2007-03-08 Thread ashwin kumar
Is the only way index pdfs is to convert it into a text and then only index it ??? On 3/8/07, Kainth, Sachin [EMAIL PROTECTED] wrote: Hi Aswin, You can try pdfbox to convert the pdf documents to text and then use Lucene to index the text. The code for turning a pdf to text is very simple:

RE: indexing pdfs

2007-03-08 Thread Kainth, Sachin
Well you don't need to actually save the text to disk and then index the saved index file, you can directly index that text in-memory. The only other way I have heard of is to use Ifilters. I believe SeekAFile does indexing of pdfs. Sachin -Original Message- From: ashwin kumar

Re: indexing pdfs

2007-03-08 Thread ashwin kumar
hi again do we have to download any jar files to run this program if so can u give me the link pls ashwin On 3/8/07, Kainth, Sachin [EMAIL PROTECTED] wrote: Well you don't need to actually save the text to disk and then index the saved index file, you can directly index that text in-memory.

Re: Lucene Ranking/scoring

2007-03-08 Thread Peter Keegan
I'm looking at how ReciprocalFloatFuncion and ReverseOrdFieldSource can be used to rank documents by score and date (solr.search.function contains great stuff!). The values in the date field that are used for the ValueSource are not actually used as 'floats', but rather their ordinal term values

RE: indexing pdfs

2007-03-08 Thread Kainth, Sachin
Hi, Here it is: http://www.seekafile.org/ -Original Message- From: ashwin kumar [mailto:[EMAIL PROTECTED] Sent: 08 March 2007 13:07 To: java-user@lucene.apache.org Subject: Re: indexing pdfs hi again do we have to download any jar files to run this program if so can u give me the

Index a source, but not store it... can it be done?

2007-03-08 Thread Walt Stoneburner
Have an interesting scenario I'd like to get your take on with respect to Lucene: A data provider (e.g. someone with a private website or corporately shared directory of proprietary documents) has requested their content be indexed with Lucene so employees can be redirected to it, but

Lucene 2.1, inconsistent phrase query results with slop

2007-03-08 Thread Erick Erickson
In a nutshell, reversing the order of the terms in a phrase query can result in different hit counts. That is, person place~3 may return different results from place person~3, depending on the number of intervening terms. There's a self-contained program below that illustrates what I'm seeing,

Re: Lucene 2.1, inconsistent phrase query results with slop

2007-03-08 Thread Yonik Seeley
On 3/8/07, Erick Erickson [EMAIL PROTECTED] wrote: In a nutshell, reversing the order of the terms in a phrase query can result in different hit counts. That is, person place~3 may return different results from place person~3, depending on the number of intervening terms. I think that's

Re: Lucene 2.1, inconsistent phrase query results with slop

2007-03-08 Thread Chris Hostetter
: I think that's working as designed. Although I could understand : someone wanting it to work differently. The slop is sort of like the : edit distance from the current given phrase, hence the order of terms : in the phrase matters. correct ... LIA has a great diagram explaining this ... the

Multiple segments

2007-03-08 Thread Kainth, Sachin
Hi all, I have been performing some tests on index segments and have a problem. I have read the file formats document on the official website and from what I can see it should be possible to create as many segments for an index as there are documents (though of course this is not a great idea).

Plural word search

2007-03-08 Thread Tony Qian
All, I'm evaluating Lucene as a full-text search engine for a project. I got one of the requirements as following: 4) Plural Literal Search If you use the plural of a term such as bears the results will include matches to the plural term bears as well as the singular term bear. it seems to

Re: Index a source, but not store it... can it be done?

2007-03-08 Thread Doron Cohen
Token positions are used also for phrase search. You could probably compromise this by setting all token positions to 0 - this would appear as if a document is a *set* of words (rather than a *list*). An adversary would be able to know/guess what words are in each document, (and, with (API)

RE: Plural word search

2007-03-08 Thread Kainth, Sachin
Hi Tony, Lucene certainly does support it. It just requires you to use a tokeniser that performs stemming such as any analyzer that uses PorterStemFilter. Sachin -Original Message- From: Tony Qian [mailto:[EMAIL PROTECTED] Sent: 08 March 2007 16:52 To: java-user@lucene.apache.org

RE: Plural word search

2007-03-08 Thread Tony Qian
Sachin, Thanks for quick response. Is there any code example i can take look? I'm not familiar with the technique you mentioned. My question is how the analyzer knows buss is not a plural and bears is a plural. Lucene supports wildcard. However, we can not use wildcard at the beginning of

Re: Multiple segments

2007-03-08 Thread Doron Cohen
margeDocs only limits the merging of already saved segments as result of calling addDocument(). If there are added documents not yet saved but rather still buffered in memory (by IndexWriter), once their number exceeds maxBufferedDocs they are saved, but as a single merged segment. So you could

Re: Term Frequency within Hits

2007-03-08 Thread Chiradeep Vittal
Term Frequency in Lucene parlance = number of occurences of the term within a single document. If you're looking for how many documents have term x where x is unknown, see SimpleFacets in Solr http://lucene.apache.org/solr/api/org/apache/solr/request/SimpleFacets.html - Original Message

RE: Plural word search

2007-03-08 Thread Chris Hostetter
: Thanks for quick response. Is there any code example i can take look? I'm : not familiar with the technique you mentioned. My question is how the : analyzer knows buss is not a plural and bears is a plural. Stemming is a vast topic of text analysis .. some stemmers work using dictionaries,

Re: Lucene 2.1, inconsistent phrase query results with slop

2007-03-08 Thread Erick Erickson
Sorry about that. I think II found the diagram you're talking about on page 89. It even addresses the exact problem I'm talking about. It's not the first time I've looked like a fool, you'd think I'd be getting used to it by now G. So, it seems like the most reasonable solution to this issue

Re: Plural word search

2007-03-08 Thread Erick Erickson
as of 2.1, as I remember, you can use leading wildcards but ONLY you set a flag (see setAllowLeadingWildcard in QueryParser). Be aware of the TooManyClauses issue though (search the mail archive and you'll find many discussions of this issue). Erick On 3/8/07, Tony Qian [EMAIL PROTECTED] wrote:

Re: Plural word search

2007-03-08 Thread Tony Qian
Erick, thanks for information. Tony From: Erick Erickson [EMAIL PROTECTED] Reply-To: java-user@lucene.apache.org To: java-user@lucene.apache.org Subject: Re: Plural word search Date: Thu, 8 Mar 2007 13:42:00 -0500 as of 2.1, as I remember, you can use leading wildcards but ONLY you set a

Re: Index a source, but not store it... can it be done?

2007-03-08 Thread Jason Pump
If you store a hash code of the word rather then the actual word you should be able to search for stuff but not be able to actually retrieve it; you can trade precision for security based on the number of bits in the hash code ( e.g. 32 or 64 bits). I'd think a 64 bit hash would be a

Re: A solution to HitCollector-based searches problems

2007-03-08 Thread oramas martín
Hello, I have just added some search implementation samples based on this collector solution, to easy the use and understanding or it: - KeywordSearch: Extract the terms (and frequency) found in a list of fields from the results of a query/filter search -

Re: Negative Filtering (such as for profanity)

2007-03-08 Thread Grant Ingersoll
I _think_ Lucene 2.1 (or is it trunk?, I lose track) has the ability to delete all documents containing a term. So, every time you update your profanity list, you could iterate over it and remove all documents that contain the terms. If a user can never get these documents via a query,

one Field in many documents

2007-03-08 Thread new333333
Hi, I have to index many documents with the same fields (only one or two fields are different). Can I add a field (Field instance) to many documents? It seams to work but I'm not sure if this is the right way... Thank you - To

Re: one Field in many documents

2007-03-08 Thread Doron Cohen
In general I would say this is not safe, because it seems to assume too much about the implementation - and while it might in most cases currently work, the implementation could change and the program assuming this would stop working. It would most probably not work correctly right from the start

Re: one Field in many documents

2007-03-08 Thread Michael D. Curtin
[EMAIL PROTECTED] wrote on 08/03/2007 12:56:33: I have to index many documents with the same fields (only one or two fields are different). Can I add a field (Field instance) to many documents? It seams to work but I'm not sure if this is the right way... What does many mean in this context?

Re: Index a source, but not store it... can it be done?

2007-03-08 Thread Chris Hostetter
: If you store a hash code of the word rather then the actual word you : should be able to search for stuff but not be able to actually retrieve that's a really great solution ... it could even be implemented asa TokenFilter so none of your client code would ever even need to know that it was

Re: Index a source, but not store it... can it be done?

2007-03-08 Thread Mike Klaas
On 3/8/07, Chris Hostetter [EMAIL PROTECTED] wrote: : If you store a hash code of the word rather then the actual word you : should be able to search for stuff but not be able to actually retrieve that's a really great solution ... it could even be implemented asa TokenFilter so none of your

Re: indexing pdfs

2007-03-08 Thread ashwin kumar
hi sachin the link wat u gave me only a zip file and an exe file for downoad. and this zip file also contains no class files.but wouldn't we be requiring a jar file or class file ??? On 3/8/07, Kainth, Sachin [EMAIL PROTECTED] wrote: Hi, Here it is: http://www.seekafile.org/ -Original

Re: Index a source, but not store it... can it be done?

2007-03-08 Thread Chris Hostetter
: I don't know... hashing individual words is an extremely weak form of : security that should be breakable without even using a computer... all : the statistical information is still there (somewhat like 'encrypting' : a message as a cryptoquote). : : Doron's suggestion is preferable: eliminate

Re: Lucene Ranking/scoring

2007-03-08 Thread Chris Hostetter
: Do I have this right? I got bit confused at first because I assumed that the : actual field values were being used in the computation, but you really need : to know the unique term count in order to get the score 'right'. you can use the actual values in FunctionQueries, except that: 1)

Re: Index a source, but not store it... can it be done?

2007-03-08 Thread Mike Klaas
On 3/8/07, Chris Hostetter [EMAIL PROTECTED] wrote: if the issue is thta you want to be abel to ship an index that people can manipulate as much as they want and you want to garuntee they can never reconstruct the original docs you're pretty much screwed ... even if you eliminate all of the

FieldCache: flush cache explicitly

2007-03-08 Thread John Wang
I think the api should allow for explicitly flush the fieldcache. I have a setup where new readers are being loaded very some period of time. I don't want to rely on Java WeakHashMap to free the cache, I want to be able to do it in a deterministic way. It would be great if this can be added to