hi can some one help me by giving any sample programs for indexing pdfs and
.doc files
thanks
regards
ashwin
Hi Aswin,
You can try pdfbox to convert the pdf documents to text and then use
Lucene to index the text. The code for turning a pdf to text is very
simple:
private static string parseUsingPDFBox(string filename)
{
// document reader
PDDocument doc =
For DOC files you can use the Jakarta POI library. Text extraction is
outlined here: http://jakarta.apache.org/poi/hwpf/quick-guide.html
Ulf
On 08.03.2007, at 10:37, ashwin kumar wrote:
hi can some one help me by giving any sample programs for indexing
pdfs and .doc files
Is the only way index pdfs is to convert it into a text and then only index
it ???
On 3/8/07, Kainth, Sachin [EMAIL PROTECTED] wrote:
Hi Aswin,
You can try pdfbox to convert the pdf documents to text and then use
Lucene to index the text. The code for turning a pdf to text is very
simple:
Well you don't need to actually save the text to disk and then index the
saved index file, you can directly index that text in-memory.
The only other way I have heard of is to use Ifilters. I believe
SeekAFile does indexing of pdfs.
Sachin
-Original Message-
From: ashwin kumar
hi again
do we have to download any jar files to run this program if so can u give me
the link pls
ashwin
On 3/8/07, Kainth, Sachin [EMAIL PROTECTED] wrote:
Well you don't need to actually save the text to disk and then index the
saved index file, you can directly index that text in-memory.
I'm looking at how ReciprocalFloatFuncion and ReverseOrdFieldSource can be
used to rank documents by score and date (solr.search.function contains
great stuff!). The values in the date field that are used for the
ValueSource are not actually used as 'floats', but rather their ordinal term
values
Hi,
Here it is:
http://www.seekafile.org/
-Original Message-
From: ashwin kumar [mailto:[EMAIL PROTECTED]
Sent: 08 March 2007 13:07
To: java-user@lucene.apache.org
Subject: Re: indexing pdfs
hi again
do we have to download any jar files to run this program if so can u
give me the
Have an interesting scenario I'd like to get your take on with respect
to Lucene:
A data provider (e.g. someone with a private website or corporately
shared directory of proprietary documents) has requested their content
be indexed with Lucene so employees can be redirected to it, but
In a nutshell, reversing the order of the terms in a phrase query can
result in different hit counts. That is, person place~3 may return
different results from place person~3, depending on the number
of intervening terms.
There's a self-contained program below that
illustrates what I'm seeing,
On 3/8/07, Erick Erickson [EMAIL PROTECTED] wrote:
In a nutshell, reversing the order of the terms in a phrase query can
result in different hit counts. That is, person place~3 may return
different results from place person~3, depending on the number
of intervening terms.
I think that's
: I think that's working as designed. Although I could understand
: someone wanting it to work differently. The slop is sort of like the
: edit distance from the current given phrase, hence the order of terms
: in the phrase matters.
correct ... LIA has a great diagram explaining this ... the
Hi all,
I have been performing some tests on index segments and have a problem.
I have read the file formats document on the official website and from
what I can see it should be possible to create as many segments for an
index as there are documents (though of course this is not a great
idea).
All,
I'm evaluating Lucene as a full-text search engine for a project. I got one
of the requirements as following:
4) Plural Literal Search
If you use the plural of a term such as bears the results will include
matches to the plural term bears as well as the singular term bear.
it seems to
Token positions are used also for phrase search.
You could probably compromise this by setting all token positions to 0 -
this would appear as if a document is a *set* of words (rather than a
*list*). An adversary would be able to know/guess what words are in each
document, (and, with (API)
Hi Tony,
Lucene certainly does support it. It just requires you to use a
tokeniser that performs stemming such as any analyzer that uses
PorterStemFilter.
Sachin
-Original Message-
From: Tony Qian [mailto:[EMAIL PROTECTED]
Sent: 08 March 2007 16:52
To: java-user@lucene.apache.org
Sachin,
Thanks for quick response. Is there any code example i can take look? I'm
not familiar with the technique you mentioned. My question is how the
analyzer knows buss is not a plural and bears is a plural.
Lucene supports wildcard. However, we can not use wildcard at the beginning
of
margeDocs only limits the merging of already saved segments as result of
calling addDocument(). If there are added documents not yet saved but
rather still buffered in memory (by IndexWriter), once their number exceeds
maxBufferedDocs they are saved, but as a single merged segment. So you
could
Term Frequency in Lucene parlance = number of occurences of the term within a
single document.
If you're looking for how many documents have term x where x is unknown, see
SimpleFacets in Solr
http://lucene.apache.org/solr/api/org/apache/solr/request/SimpleFacets.html
- Original Message
: Thanks for quick response. Is there any code example i can take look? I'm
: not familiar with the technique you mentioned. My question is how the
: analyzer knows buss is not a plural and bears is a plural.
Stemming is a vast topic of text analysis .. some stemmers work using
dictionaries,
Sorry about that. I think II found the diagram you're talking about on page
89.
It even addresses the exact problem I'm talking about.
It's not the first time I've looked like a fool, you'd think I'd be getting
used to it by now G.
So, it seems like the most reasonable solution to this issue
as of 2.1, as I remember, you can use leading wildcards but ONLY
you set a flag (see setAllowLeadingWildcard in QueryParser). Be
aware of the TooManyClauses issue though (search the mail
archive and you'll find many discussions of this issue).
Erick
On 3/8/07, Tony Qian [EMAIL PROTECTED] wrote:
Erick,
thanks for information.
Tony
From: Erick Erickson [EMAIL PROTECTED]
Reply-To: java-user@lucene.apache.org
To: java-user@lucene.apache.org
Subject: Re: Plural word search
Date: Thu, 8 Mar 2007 13:42:00 -0500
as of 2.1, as I remember, you can use leading wildcards but ONLY
you set a
If you store a hash code of the word rather then the actual word you
should be able to search for stuff but not be able to actually retrieve
it; you can trade precision for security based on the number of bits
in the hash code ( e.g. 32 or 64 bits). I'd think a 64 bit hash would be
a
Hello,
I have just added some search implementation samples based on this collector
solution, to easy the use and understanding or it:
- KeywordSearch: Extract the terms (and frequency) found in a list of
fields
from the results of a query/filter search
-
I _think_ Lucene 2.1 (or is it trunk?, I lose track) has the ability
to delete all documents containing a term. So, every time you update
your profanity list, you could iterate over it and remove all
documents that contain the terms.
If a user can never get these documents via a query,
Hi,
I have to index many documents with the same fields (only one or two
fields are different). Can I add a field (Field instance) to many
documents? It seams to work but I'm not sure if this is the right way...
Thank you
-
To
In general I would say this is not safe, because it seems to assume too
much about the implementation - and while it might in most cases currently
work, the implementation could change and the program assuming this would
stop working. It would most probably not work correctly right from the
start
[EMAIL PROTECTED] wrote on 08/03/2007 12:56:33:
I have to index many documents with the same fields (only one or two
fields are different). Can I add a field (Field instance) to many
documents? It seams to work but I'm not sure if this is the right way...
What does many mean in this context?
: If you store a hash code of the word rather then the actual word you
: should be able to search for stuff but not be able to actually retrieve
that's a really great solution ... it could even be implemented asa
TokenFilter so none of your client code would ever even need to know that
it was
On 3/8/07, Chris Hostetter [EMAIL PROTECTED] wrote:
: If you store a hash code of the word rather then the actual word you
: should be able to search for stuff but not be able to actually retrieve
that's a really great solution ... it could even be implemented asa
TokenFilter so none of your
hi sachin the link wat u gave me only a zip file and an exe file for
downoad. and this zip file also contains no class files.but wouldn't we be
requiring a jar file or class file ???
On 3/8/07, Kainth, Sachin [EMAIL PROTECTED] wrote:
Hi,
Here it is:
http://www.seekafile.org/
-Original
: I don't know... hashing individual words is an extremely weak form of
: security that should be breakable without even using a computer... all
: the statistical information is still there (somewhat like 'encrypting'
: a message as a cryptoquote).
:
: Doron's suggestion is preferable: eliminate
: Do I have this right? I got bit confused at first because I assumed that the
: actual field values were being used in the computation, but you really need
: to know the unique term count in order to get the score 'right'.
you can use the actual values in FunctionQueries, except that:
1)
On 3/8/07, Chris Hostetter [EMAIL PROTECTED] wrote:
if the issue is thta you want to be abel to ship an index that people can
manipulate as much as they want and you want to garuntee they can never
reconstruct the original docs you're pretty much screwed ... even if you
eliminate all of the
I think the api should allow for explicitly flush the fieldcache.
I have a setup where new readers are being loaded very some period of
time. I don't want to rely on Java WeakHashMap to free the cache, I
want to be able to do it in a deterministic way.
It would be great if this can be added to
36 matches
Mail list logo