Extracting data from Lucene index files

2006-12-13 Thread Venkateshprasanna
I would like to use the data stored in the Lucene indexes, like the words and their frequencies and store them in a database. Can anyone suggest a way of going about it or is it possible at all? TIA Prasanna -- View this message in context:

Re: Extracting data from Lucene index files

2006-12-13 Thread Grant Ingersoll
Take a look at TermDocs and TermEnum. -Grant On Dec 13, 2006, at 6:02 AM, Venkateshprasanna wrote: I would like to use the data stored in the Lucene indexes, like the words and their frequencies and store them in a database. Can anyone suggest a way of going about it or is it possible

Indexing clarification , please advice

2006-12-13 Thread abdul aleem
Hello All, Apolgies if it is a naive question a) Indexing large file ( more than 4MB ) Do i need to read the entire file as string using java.io and create a Document object ? The file contains timestamp, if i need to index on timestamp is parsing the entire file manually

Re: Indexing clarification , please advice

2006-12-13 Thread Erick Erickson
Let me take a crack at it. See below... On 12/13/06, abdul aleem [EMAIL PROTECTED] wrote: Hello All, Apolgies if it is a naive question a) Indexing large file ( more than 4MB ) Do i need to read the entire file as string using java.io and create a Document object ? Essentially yes.

Re: Indexing clarification , please advice

2006-12-13 Thread abdul aleem
Many thanks Erick, Your points are valid, i was thinking entire Log file as a lucene document, im wrong trying to chop the log file might be the way to go my bad expressions , yes you got that right timestamp must be added as a FIELD that is what i meant really appreciate your detailed reply,

Re: Lucene scoring: coord_q_d factor

2006-12-13 Thread Karl Koch
Do you know about any papers that discuss this? Karl Original-Nachricht Datum: Wed, 13 Dec 2006 10:31:41 -0500 Von: Yonik Seeley [EMAIL PROTECTED] An: java-user@lucene.apache.org Betreff: Re: Lucene scoring: coord_q_d factor On 12/13/06, Karl Koch [EMAIL PROTECTED] wrote:

RE: de-boosting fields

2006-12-13 Thread Scott Smith
One other thing I discovered that I mention so no one else is tripped up by it. I set the boost to zero for the categories in the query. When I ran my unit tests, some of them started to fail. I eventually realized that the failures were in searches where I only wanted to find documents in

Problems with Queries which contain '_' and wildcards

2006-12-13 Thread Stefan Schütz
Hi, first let me explain the situation: We have to index an document, which contains a field file to store filenames. Sometimes filenames contain an underscore or an minus (_ or -). = e.g. foo_bar.doc Indexing is'nt the problem so far. But if we now try to search for foo_b* the

Lucene LSA

2006-12-13 Thread mariolone
Hi I have a problem: i must create a matrix term for document in which every element of the matrix it represents the number of occurrences of that term in the document. How can I do? Can someone help me? Thanks to all P.S. I must applicate LSA to this matrix. -- View this message in

Re: Problems with Queries which contain '_' and wildcards

2006-12-13 Thread Ronnie Kolehmainen
I recognize that error message ;) You're using AnalyzingQueryParser http://lucene.apache.org/java/docs/api/org/apache/lucene/queryParser/analyzing/AnalyzingQueryParser.html - yes? These are imo the two most obvious options: 1. Revert to standard QueryParser - it won't analyze prefix- and

Re: lucene functionality

2006-12-13 Thread Patrick Turcotte
I would suggest you take a look at exist-db (http://exist-db.org/). A database for XML documents that support XQuery. We are using both products here (lucene and exist-db), and for what you are looking for, exist-db seems better. Our documents are far more complex than yours (about 500

Re: lucene functionality

2006-12-13 Thread Marcelo Ochoa
Hi Mark: For 10 million records We recommend an strong database such as Oracle. You can annotate the Schema (.xsd) which describes your XML record to store some field in traditional VARCHAR2 or NUMBER columns to query it faster, and DRECONTENT in a CLOB column. You can find more information

Re: lucene functionality

2006-12-13 Thread Doron Cohen
Lucene RangeQuery would do for the time and numeric reqs. Mark Mei [EMAIL PROTECTED] wrote: At the bottom of this email is the sample xml file that we are using today. We have about 10 million of these. We need to know whether Lucene can support the following functionalities. (1) Each field

Re: Indexing clarification , please advice

2006-12-13 Thread Daniel Naber
On Wednesday 13 December 2006 14:10, abdul aleem wrote: a) Indexing large file ( more than 4MB )    Do i need to read the entire file as string using    java.io and create a Document object ? You can also use a reader:

Re: Advice on 3NF Data Structures and Lucene Please

2006-12-13 Thread Chris Lu
You are right. Database usually is in 3NF, while lucene usually works on an array of objects. Different database has different data model. There are quite some efforts to crawl database, create the lucene index, keep it in sync with the database, and rendering the search results. If data model

Re: Lucene scoring: coord_q_d factor

2006-12-13 Thread Paul Elschot
On Wednesday 13 December 2006 16:42, Karl Koch wrote: Do you know about any papers that discuss this? Coordination is called co-ordination In the original idf paper by K. Spärck Jones, A statistical interpretation of term specificity and its application in retrieval., Journal of Documentation

Re: lucene functionality

2006-12-13 Thread Chris Hostetter
: For 10 million records We recommend an strong database such as Oracle. eh ... who is We in that statement? I Suspect you'll find other people on this list who have no problems running Lucene indexes containing 10 million documents. If you want a database, then by all means use a database,

Re: lucene functionality

2006-12-13 Thread Marcelo Ochoa
Hi Chris: On 12/13/06, Chris Hostetter [EMAIL PROTECTED] wrote: : For 10 million records We recommend an strong database such as Oracle. eh ... who is We in that statement? We are independent consultants working for many years with Oracle databases ;) I Suspect you'll find other people

Index Excel File

2006-12-13 Thread spinergywmy
Hi, Is anyone index an excel file before? I took a look at the API classes provided by POI HSSF, however, I did not find any method to extract the text from excel file and index them. Please assist and leet me know where I can find the example to refer to. Thanks regards, Wooi Meng --

Re: Advice on 3NF Data Structures and Lucene Please

2006-12-13 Thread Chris Lu
I think the last structure is good. The index should be structured according to how you want to search it. If your needs changed, you should simply have another index. One index for all is not really good. Index is more of trading space for time, so duplication is not really a concern. The first

Announcing: IBM OmniFind Yahoo! Edition

2006-12-13 Thread Andreas Neumann
As you may have already heard, IBM and Yahoo! today released a new product named IBM OmniFind Yahoo! Editionhttp://omnifind.ibm.yahoo.net/productinfo.php. It is a free-of-charge search engine for web sites and file systems, which builds on Lucene and other components such as UIMA