indexing pdfs

2007-03-08 Thread ashwin kumar

hi can some one help me by giving any sample programs for indexing pdfs and
.doc files

thanks
regards
ashwin


RE: indexing pdfs

2007-03-08 Thread Kainth, Sachin
Hi Aswin,

You can try pdfbox to convert the pdf documents to text and then use
Lucene to index the text.  The code for turning a pdf to text is very
simple:

private static string parseUsingPDFBox(string filename)
{
// document reader
PDDocument doc = PDDocument.load(filename);
// create stripper (wish I had the power to do that -
wouldn't leave the house)
PDFTextStripper stripper = new PDFTextStripper();
// get text from doc using stripper
return stripper.getText(doc);
}

Sachin

-Original Message-
From: ashwin kumar [mailto:[EMAIL PROTECTED] 
Sent: 08 March 2007 09:37
To: java-user@lucene.apache.org
Subject: indexing pdfs

hi can some one help me by giving any sample programs for indexing pdfs
and .doc files

thanks
regards
ashwin


This message has been scanned for viruses by MailControl - (see
http://bluepages.wsatkins.co.uk/?6875772)


This email and any attached files are confidential and copyright protected. If 
you are not the addressee, any dissemination of this communication is strictly 
prohibited. Unless otherwise expressly agreed in writing, nothing stated in 
this communication shall be legally binding.

The ultimate parent company of the Atkins Group is WS Atkins plc.  Registered 
in England No. 1885586.  Registered Office Woodcote Grove, Ashley Road, Epsom, 
Surrey KT18 5BW.

Consider the environment. Please don't print this e-mail unless you really need 
to. 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: indexing pdfs

2007-03-08 Thread Ulf Dittmer
For DOC files you can use the Jakarta POI library. Text extraction is  
outlined here: http://jakarta.apache.org/poi/hwpf/quick-guide.html


Ulf

On 08.03.2007, at 10:37, ashwin kumar wrote:

hi can some one help me by giving any sample programs for indexing  
pdfs and .doc files



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: indexing pdfs

2007-03-08 Thread ashwin kumar

Is the only way index pdfs is to convert it into a text and then only index
it ???



On 3/8/07, Kainth, Sachin [EMAIL PROTECTED] wrote:


Hi Aswin,

You can try pdfbox to convert the pdf documents to text and then use
Lucene to index the text.  The code for turning a pdf to text is very
simple:

private static string parseUsingPDFBox(string filename)
{
// document reader
PDDocument doc = PDDocument.load(filename);
// create stripper (wish I had the power to do that -
wouldn't leave the house)
PDFTextStripper stripper = new PDFTextStripper();
// get text from doc using stripper
return stripper.getText(doc);
}

Sachin

-Original Message-
From: ashwin kumar [mailto:[EMAIL PROTECTED]
Sent: 08 March 2007 09:37
To: java-user@lucene.apache.org
Subject: indexing pdfs

hi can some one help me by giving any sample programs for indexing pdfs
and .doc files

thanks
regards
ashwin


This message has been scanned for viruses by MailControl - (see
http://bluepages.wsatkins.co.uk/?6875772)


This email and any attached files are confidential and copyright
protected. If you are not the addressee, any dissemination of this
communication is strictly prohibited. Unless otherwise expressly agreed in
writing, nothing stated in this communication shall be legally binding.

The ultimate parent company of the Atkins Group is WS Atkins
plc.  Registered in England No. 1885586.  Registered Office Woodcote Grove,
Ashley Road, Epsom, Surrey KT18 5BW.

Consider the environment. Please don't print this e-mail unless you really
need to.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




RE: indexing pdfs

2007-03-08 Thread Kainth, Sachin
Well you don't need to actually save the text to disk and then index the
saved index file, you can directly index that text in-memory. 

The only other way I have heard of is to use Ifilters.  I believe
SeekAFile does indexing of pdfs.

Sachin 

-Original Message-
From: ashwin kumar [mailto:[EMAIL PROTECTED] 
Sent: 08 March 2007 11:35
To: java-user@lucene.apache.org
Subject: Re: indexing pdfs

Is the only way index pdfs is to convert it into a text and then only
index it ???



On 3/8/07, Kainth, Sachin [EMAIL PROTECTED] wrote:

 Hi Aswin,

 You can try pdfbox to convert the pdf documents to text and then use 
 Lucene to index the text.  The code for turning a pdf to text is very
 simple:

 private static string parseUsingPDFBox(string filename)
 {
 // document reader
 PDDocument doc = PDDocument.load(filename);
 // create stripper (wish I had the power to do that - 
 wouldn't leave the house)
 PDFTextStripper stripper = new PDFTextStripper();
 // get text from doc using stripper
 return stripper.getText(doc);
 }

 Sachin

 -Original Message-
 From: ashwin kumar [mailto:[EMAIL PROTECTED]
 Sent: 08 March 2007 09:37
 To: java-user@lucene.apache.org
 Subject: indexing pdfs

 hi can some one help me by giving any sample programs for indexing 
 pdfs and .doc files

 thanks
 regards
 ashwin


 This message has been scanned for viruses by MailControl - (see
 http://bluepages.wsatkins.co.uk/?6875772)


 This email and any attached files are confidential and copyright 
 protected. If you are not the addressee, any dissemination of this 
 communication is strictly prohibited. Unless otherwise expressly 
 agreed in writing, nothing stated in this communication shall be
legally binding.

 The ultimate parent company of the Atkins Group is WS Atkins plc.  
 Registered in England No. 1885586.  Registered Office Woodcote Grove, 
 Ashley Road, Epsom, Surrey KT18 5BW.

 Consider the environment. Please don't print this e-mail unless you 
 really need to.

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: indexing pdfs

2007-03-08 Thread ashwin kumar

hi again
do we have to download any jar files to run this program if so can u give me
the link pls

ashwin

On 3/8/07, Kainth, Sachin [EMAIL PROTECTED] wrote:


Well you don't need to actually save the text to disk and then index the
saved index file, you can directly index that text in-memory.

The only other way I have heard of is to use Ifilters.  I believe
SeekAFile does indexing of pdfs.

Sachin

-Original Message-
From: ashwin kumar [mailto:[EMAIL PROTECTED]
Sent: 08 March 2007 11:35
To: java-user@lucene.apache.org
Subject: Re: indexing pdfs

Is the only way index pdfs is to convert it into a text and then only
index it ???



On 3/8/07, Kainth, Sachin [EMAIL PROTECTED] wrote:

 Hi Aswin,

 You can try pdfbox to convert the pdf documents to text and then use
 Lucene to index the text.  The code for turning a pdf to text is very
 simple:

 private static string parseUsingPDFBox(string filename)
 {
 // document reader
 PDDocument doc = PDDocument.load(filename);
 // create stripper (wish I had the power to do that -
 wouldn't leave the house)
 PDFTextStripper stripper = new PDFTextStripper();
 // get text from doc using stripper
 return stripper.getText(doc);
 }

 Sachin

 -Original Message-
 From: ashwin kumar [mailto:[EMAIL PROTECTED]
 Sent: 08 March 2007 09:37
 To: java-user@lucene.apache.org
 Subject: indexing pdfs

 hi can some one help me by giving any sample programs for indexing
 pdfs and .doc files

 thanks
 regards
 ashwin


 This message has been scanned for viruses by MailControl - (see
 http://bluepages.wsatkins.co.uk/?6875772)


 This email and any attached files are confidential and copyright
 protected. If you are not the addressee, any dissemination of this
 communication is strictly prohibited. Unless otherwise expressly
 agreed in writing, nothing stated in this communication shall be
legally binding.

 The ultimate parent company of the Atkins Group is WS Atkins plc.
 Registered in England No. 1885586.  Registered Office Woodcote Grove,
 Ashley Road, Epsom, Surrey KT18 5BW.

 Consider the environment. Please don't print this e-mail unless you
 really need to.

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: Lucene Ranking/scoring

2007-03-08 Thread Peter Keegan

I'm looking at how ReciprocalFloatFuncion and ReverseOrdFieldSource can be
used to rank documents by score and date (solr.search.function contains
great stuff!). The values in the date field that are used for the
ValueSource are not actually used as 'floats', but rather their ordinal term
values from the FieldCache string index. This means that if the 'date' field
has 3000 unique string 'values' in the index, the values for 'x' in
ReciprocalFloatFuncion could be 0-2999. So if I want the most recent 'date'
to return a score of 1.0, one could set 'a' and 'b' in the function to
2999.

Do I have this right? I got bit confused at first because I assumed that the
actual field values were being used in the computation, but you really need
to know the unique term count in order to get the score 'right'.

By the way, as I try to get my head around the Score, Weight, and Boolean*
classes (and next(), skipTo()), I nominate these for discussion in Lucene In
Action II.

Peter

On 3/9/06, Yonik Seeley [EMAIL PROTECTED] wrote:


On 3/9/06, Yang Sun [EMAIL PROTECTED] wrote:
 Hi Yonik,
 Thanks very much for your suggestion. The query boost works great for
 keyword matching. But in my case, I need to rank the results by date and
 title. For example, title:foo^2 abstract:foo^1.5 date:2004^3 will only
boost
 the document with date=2004. What I need is boosting the distance from
the
 specified date

If all you need to do is boost more recent documents (and a single
fixed boost will always work), then you can do that boosting at index
time.

 which means 2003 will have a better ranking than 2002,
 20022001, etc.
 I implemented a customized ScoreDocComparator class which works fine for
one
 field. But I met some trouble when trying to combine other fields
together.
 I'm still looking at FunctionQuery. Don't know if I can figure out
 something.

FunctionQuery support is integrated into Solr (or currently hacked-in,
as the case may be),  and can be useful for debugging and trying out
query types even if you don't use it for your runtime.

ReciprocalFloatFunction might meet your needs for increasing the score
of more recent documents:

http://incubator.apache.org/solr/docs/api/org/apache/solr/search/function/ReciprocalFloatFunction.html

The SolrQueryParser can make
ReciprocalFloatFunction(new ReverseOrdFieldSource(my_date),1,1000,1000)
out of _val_:recip(rord(my_date),1,1000,1000)

-Yonik
http://incubator.apache.org/solr Solr, The Open Source Lucene Search
Server

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




RE: indexing pdfs

2007-03-08 Thread Kainth, Sachin
Hi,

Here it is:

http://www.seekafile.org/ 

-Original Message-
From: ashwin kumar [mailto:[EMAIL PROTECTED] 
Sent: 08 March 2007 13:07
To: java-user@lucene.apache.org
Subject: Re: indexing pdfs

hi again
do we have to download any jar files to run this program if so can u
give me the link pls

ashwin

On 3/8/07, Kainth, Sachin [EMAIL PROTECTED] wrote:

 Well you don't need to actually save the text to disk and then index 
 the saved index file, you can directly index that text in-memory.

 The only other way I have heard of is to use Ifilters.  I believe 
 SeekAFile does indexing of pdfs.

 Sachin

 -Original Message-
 From: ashwin kumar [mailto:[EMAIL PROTECTED]
 Sent: 08 March 2007 11:35
 To: java-user@lucene.apache.org
 Subject: Re: indexing pdfs

 Is the only way index pdfs is to convert it into a text and then only 
 index it ???



 On 3/8/07, Kainth, Sachin [EMAIL PROTECTED] wrote:
 
  Hi Aswin,
 
  You can try pdfbox to convert the pdf documents to text and then use

  Lucene to index the text.  The code for turning a pdf to text is 
  very
  simple:
 
  private static string parseUsingPDFBox(string filename)
  {
  // document reader
  PDDocument doc = PDDocument.load(filename);
  // create stripper (wish I had the power to do that - 
  wouldn't leave the house)
  PDFTextStripper stripper = new PDFTextStripper();
  // get text from doc using stripper
  return stripper.getText(doc);
  }
 
  Sachin
 
  -Original Message-
  From: ashwin kumar [mailto:[EMAIL PROTECTED]
  Sent: 08 March 2007 09:37
  To: java-user@lucene.apache.org
  Subject: indexing pdfs
 
  hi can some one help me by giving any sample programs for indexing 
  pdfs and .doc files
 
  thanks
  regards
  ashwin
 
 
  This message has been scanned for viruses by MailControl - (see
  http://bluepages.wsatkins.co.uk/?6875772)
 
 
  This email and any attached files are confidential and copyright 
  protected. If you are not the addressee, any dissemination of this 
  communication is strictly prohibited. Unless otherwise expressly 
  agreed in writing, nothing stated in this communication shall be
 legally binding.
 
  The ultimate parent company of the Atkins Group is WS Atkins plc.
  Registered in England No. 1885586.  Registered Office Woodcote 
  Grove, Ashley Road, Epsom, Surrey KT18 5BW.
 
  Consider the environment. Please don't print this e-mail unless you 
  really need to.
 
  
  - To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Index a source, but not store it... can it be done?

2007-03-08 Thread Walt Stoneburner

Have an interesting scenario I'd like to get your take on with respect
to Lucene:

A data provider (e.g. someone with a private website or corporately
shared directory of proprietary documents) has requested their content
be indexed with Lucene so employees can be redirected to it, but
provisionally -- under no circumstance should that content be stored
or recreated from the index.

Is that even possible?

The data owner's request makes sense in the context of them wanting to
retain full access control via logins as well as collecting access
metrics.

If the token 'CAT' points to C:\Corporate\animals.doc and the token
'DOG' points also points there, then great, CAT AND DOG will give that
document a higher rating, though it is not possible to reconstruct
(with any great accuracy) what the actual document content is.

However, if for the sake of using the NEAR operator with Lucene the
tokens are stored as  LET'S:1 SELL:2 CAT:3 AND:4 DOG:5 ROBOT:6 TOYS:7
THIS:8 DECEMBER:9 ... then someone could pull all tokens for
animal.doc and reconstitute the token stream.

Does Lucene have any kind of trade off for working with secure (and
I use this term loosely) data?

-wls

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Lucene 2.1, inconsistent phrase query results with slop

2007-03-08 Thread Erick Erickson

In a nutshell, reversing the order of the terms in a phrase query can
result in different hit counts. That is, person place~3 may return
different results from place person~3, depending on the number
of intervening terms.


There's a self-contained program below that
illustrates what I'm seeing, along with output.

SpanNear does not exhibit this behavior, so I can make things work.

I didn't find anything in my (admittedly brief) search of the archives
or the open issues that directly spoke to this.

Several questions:

1 is this a bug or not?

2 is anyone working on it or should I dig into it? It looks like
it may be related to LUCENE-736.

3 does the phrase from LIA (pg 208) Given enough slop,
PhraseQuery will match terms out of order in the original text. apply here?

4 Do you want me to post this on the developers list (I can hear it
now... not unless you also post a patch too G)

Thanks
Erick


import org.apache.lucene.analysis.WhitespaceAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.spans.SpanNearQuery;
import org.apache.lucene.search.spans.SpanQuery;
import org.apache.lucene.search.spans.SpanTermQuery;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;


public class PhraseProblem
{
   public static void main(String[] args)
   {
   try {
   PhraseProblem pp = new PhraseProblem();

   pp.tryIt();
   } catch (Exception e) {
   e.printStackTrace();
   }
   }


   private void tryIt() throws Exception
   {
   Directory dir = new RAMDirectory();
   IndexWriter writer = new IndexWriter(dir, new WhitespaceAnalyzer());
   Document doc = new Document();

   doc.add(
   new Field(
   field,
   person space space space place,
   Field.Store.YES,
   Field.Index.TOKENIZED));
   writer.addDocument(doc);
   writer.close();

   IndexSearcher searcher = new IndexSearcher(dir);

   System.out.println(trying phrase queries);
   this.trySlop(searcher, 2);
   this.trySlop(searcher, 3); //FAILS
   this.trySlop(searcher, 4); //FAILS
   this.trySlop(searcher, 5);
   this.trySlop(searcher, 6);
   this.trySlop(searcher, 7);

   System.out.println(trying SpanNear queries);
   this.trySpan(searcher, 2);
   this.trySpan(searcher, 3);
   this.trySpan(searcher, 4);
   this.trySpan(searcher, 5);
   this.trySpan(searcher, 6);
   this.trySpan(searcher, 7);
   }

   private void trySpan(IndexSearcher searcher, int slop) throws Exception
{

   SpanQuery sq1 = new SpanTermQuery(new Term(field, person));
   SpanQuery sq2 = new SpanTermQuery(new Term(field, place));
   SpanNearQuery sqn1 = new SpanNearQuery(
   new SpanQuery[] {sq1, sq2}, slop, false);

   SpanNearQuery sqn2 = new SpanNearQuery(
   new SpanQuery[] {sq2, sq1}, slop, false);

   Hits hits1 = searcher.search(sqn1);
   Hits hits2 = searcher.search(sqn2);

   this.printResults(hits1, hits2, slop);
   }

   private void trySlop(IndexSearcher searcher, int slop)
   throws Exception
   {
   QueryParser qp = new QueryParser(field, new WhitespaceAnalyzer());
   Query query1 = qp.parse(String.format(\person place\~%d, slop));
   Query query2 = qp.parse(String.format(\place person\~%d, slop));

   Hits hits1 = searcher.search(query1);
   Hits hits2 = searcher.search(query2);
   this.printResults(hits1, hits2, slop);
   }
   private void printResults(Hits hits1, Hits hits2, int slop) {
   if (hits1.length() != hits2.length()) {
   System.out.println(
   String.format(
   Unequeal hit counts. hits1.length %d,
hits2.length %d slop : %d,
   hits1.length(),
   hits2.length(),
   slop));
   } else {
   System.out.println(
   String.format(
   Found identical hit counts of %d, slop: %d,
   hits1.length(),
   slop));
   }
   }
}

output

trying phrase queries

Found identical hit counts of 0, slop: 2
Unequeal hit counts. hits1.length 1, hits2.length 0 slop : 3
Unequeal hit counts. hits1.length 1, hits2.length 0 slop : 4
Found identical hit counts of 1, slop: 5
Found identical hit counts of 1, slop: 6
Found identical hit counts of 1, slop: 7


trying SpanNear queries

Found identical hit counts of 0, slop: 2
Found identical hit counts of 1, 

Re: Lucene 2.1, inconsistent phrase query results with slop

2007-03-08 Thread Yonik Seeley

On 3/8/07, Erick Erickson [EMAIL PROTECTED] wrote:

In a nutshell, reversing the order of the terms in a phrase query can
result in different hit counts. That is, person place~3 may return
different results from place person~3, depending on the number
of intervening terms.


I think that's working as designed.   Although I could understand
someone wanting it to work differently.  The slop is sort of like the
edit distance from the current given phrase, hence the order of terms
in the phrase matters.

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene 2.1, inconsistent phrase query results with slop

2007-03-08 Thread Chris Hostetter

: I think that's working as designed.   Although I could understand
: someone wanting it to work differently.  The slop is sort of like the
: edit distance from the current given phrase, hence the order of terms
: in the phrase matters.

correct ... LIA has a great diagram explaining this ... the slop refers to
how many positions you have to move the terms in the PhraseQuery to match.



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Multiple segments

2007-03-08 Thread Kainth, Sachin
Hi all,

I have been performing some tests on index segments and have a problem.
I have read the file formats document on the official website and from
what I can see it should be possible to create as many segments for an
index as there are documents (though of course this is not a great
idea).  Having searched around it occurred to be that the way to do this
is to set maxMergeDocs to 1.  Having tried this I found that it doesn't
work.  All documents still get put into one segment.  Any idea what I
should do?

Thanks


This email and any attached files are confidential and copyright protected. If 
you are not the addressee, any dissemination of this communication is strictly 
prohibited. Unless otherwise expressly agreed in writing, nothing stated in 
this communication shall be legally binding.

The ultimate parent company of the Atkins Group is WS Atkins plc.  Registered 
in England No. 1885586.  Registered Office Woodcote Grove, Ashley Road, Epsom, 
Surrey KT18 5BW.

Consider the environment. Please don't print this e-mail unless you really need 
to. 


Plural word search

2007-03-08 Thread Tony Qian

All,

I'm evaluating Lucene as a full-text search engine for a project. I got one 
of the requirements as following:


4) Plural Literal Search
If you use the plural of a term such as bears the results will include 
matches to the plural term bears as well as the singular term bear.


it seems to me we need to build a dictionary to support it. Does Lucene 
support it?


appreciate your help.

Tony

_
Don’t miss your chance to WIN 10 hours of private jet travel from Microsoft® 
Office Live http://clk.atdmt.com/MRT/go/mcrssaub0540002499mrt/direct/01/



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index a source, but not store it... can it be done?

2007-03-08 Thread Doron Cohen
Token positions are used also for phrase search.

You could probably compromise this by setting all token positions to 0 -
this would appear as if a document is a *set* of words (rather than a
*list*). An adversary would be able to know/guess what words are in each
document, (and, with (API) access to the index itself, how many times each
word appear in each document), but would not be able to reconstruct a
good approximation of that document, because term positions are all 0. If
this is sufficient, I think you can do it by writing your own Analyzer with
a TokenFilter that takes care of the position - see Token.
setPositionIncrement().

Hope this helps,
Doron

Walt Stoneburner [EMAIL PROTECTED] wrote on 08/03/2007
07:28:59:

 Have an interesting scenario I'd like to get your take on with respect
 to Lucene:

 A data provider (e.g. someone with a private website or corporately
 shared directory of proprietary documents) has requested their content
 be indexed with Lucene so employees can be redirected to it, but
 provisionally -- under no circumstance should that content be stored
 or recreated from the index.

 Is that even possible?

 The data owner's request makes sense in the context of them wanting to
 retain full access control via logins as well as collecting access
 metrics.

 If the token 'CAT' points to C:\Corporate\animals.doc and the token
 'DOG' points also points there, then great, CAT AND DOG will give that
 document a higher rating, though it is not possible to reconstruct
 (with any great accuracy) what the actual document content is.

 However, if for the sake of using the NEAR operator with Lucene the
 tokens are stored as  LET'S:1 SELL:2 CAT:3 AND:4 DOG:5 ROBOT:6 TOYS:7
 THIS:8 DECEMBER:9 ... then someone could pull all tokens for
 animal.doc and reconstitute the token stream.

 Does Lucene have any kind of trade off for working with secure (and
 I use this term loosely) data?

 -wls


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Plural word search

2007-03-08 Thread Kainth, Sachin
Hi Tony,

Lucene certainly does support it.  It just requires you to use a
tokeniser that performs stemming such as any analyzer that uses
PorterStemFilter.

Sachin 

-Original Message-
From: Tony Qian [mailto:[EMAIL PROTECTED] 
Sent: 08 March 2007 16:52
To: java-user@lucene.apache.org
Subject: Plural word search

All,

I'm evaluating Lucene as a full-text search engine for a project. I got
one of the requirements as following:

4) Plural Literal Search
If you use the plural of a term such as bears the results will include
matches to the plural term bears as well as the singular term bear.

it seems to me we need to build a dictionary to support it. Does Lucene
support it?

appreciate your help.

Tony

_
Don't miss your chance to WIN 10 hours of private jet travel from
Microsoft(r) Office Live
http://clk.atdmt.com/MRT/go/mcrssaub0540002499mrt/direct/01/


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



This email and any attached files are confidential and copyright protected. If 
you are not the addressee, any dissemination of this communication is strictly 
prohibited. Unless otherwise expressly agreed in writing, nothing stated in 
this communication shall be legally binding.

The ultimate parent company of the Atkins Group is WS Atkins plc.  Registered 
in England No. 1885586.  Registered Office Woodcote Grove, Ashley Road, Epsom, 
Surrey KT18 5BW.

Consider the environment. Please don't print this e-mail unless you really need 
to. 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Plural word search

2007-03-08 Thread Tony Qian


Sachin,

Thanks for quick response. Is there any code example i can take look? I'm 
not familiar with the technique you mentioned. My question is how the 
analyzer knows buss is not a plural and bears is a plural.


Lucene supports wildcard. However, we can not use wildcard at the beginning 
of search term such as *bear. is there a way to match *bear* (bear, bears, 
forbearance etc.) by search tern bear?


thanks


From: Kainth, Sachin [EMAIL PROTECTED]
Reply-To: java-user@lucene.apache.org
To: java-user@lucene.apache.org
Subject: RE: Plural word search
Date: Thu, 8 Mar 2007 17:14:02 -

Hi Tony,

Lucene certainly does support it.  It just requires you to use a
tokeniser that performs stemming such as any analyzer that uses
PorterStemFilter.

Sachin

-Original Message-
From: Tony Qian [mailto:[EMAIL PROTECTED]
Sent: 08 March 2007 16:52
To: java-user@lucene.apache.org
Subject: Plural word search

All,

I'm evaluating Lucene as a full-text search engine for a project. I got
one of the requirements as following:

4) Plural Literal Search
If you use the plural of a term such as bears the results will include
matches to the plural term bears as well as the singular term bear.

it seems to me we need to build a dictionary to support it. Does Lucene
support it?

appreciate your help.

Tony

_
Don't miss your chance to WIN 10 hours of private jet travel from
Microsoft(r) Office Live
http://clk.atdmt.com/MRT/go/mcrssaub0540002499mrt/direct/01/


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



This email and any attached files are confidential and copyright protected. 
If you are not the addressee, any dissemination of this communication is 
strictly prohibited. Unless otherwise expressly agreed in writing, nothing 
stated in this communication shall be legally binding.


The ultimate parent company of the Atkins Group is WS Atkins plc.  
Registered in England No. 1885586.  Registered Office Woodcote Grove, 
Ashley Road, Epsom, Surrey KT18 5BW.


Consider the environment. Please don't print this e-mail unless you really 
need to.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



_
Find a local pizza place, movie theater, and more….then map the best route! 
http://maps.live.com/?icid=hmtag1FORM=MGAC01



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Multiple segments

2007-03-08 Thread Doron Cohen
margeDocs only limits the merging of already saved segments as result of
calling addDocument(). If there are added documents not yet saved but
rather still buffered in memory (by IndexWriter), once their number exceeds
maxBufferedDocs they are saved, but as a single merged segment. So you
could set maxBufferedDocs to 2 (that's the minimal value) and maxMergedDocs
to 1 and add N documents to the index, - that would likely result in N/2
segments. You could probably force N segments by closing the index after
each add and reopen it before the next add. Note that while such settings
might be interesting for learning purposes, that would have an unpleasant
performance impact...  Last, calling optimize(), no matter what above
settings are, a single segment is created.

Regards,
Doron

Kainth, Sachin [EMAIL PROTECTED] wrote on 08/03/2007
08:37:27:

 Hi all,

 I have been performing some tests on index segments and have a problem.
 I have read the file formats document on the official website and from
 what I can see it should be possible to create as many segments for an
 index as there are documents (though of course this is not a great
 idea).  Having searched around it occurred to be that the way to do this
 is to set maxMergeDocs to 1.  Having tried this I found that it doesn't
 work.  All documents still get put into one segment.  Any idea what I
 should do?

 Thanks


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Term Frequency within Hits

2007-03-08 Thread Chiradeep Vittal
Term Frequency in Lucene parlance = number of occurences of the term within a 
single document.
If you're looking for how many documents have term x where x is unknown, see 
SimpleFacets in Solr
http://lucene.apache.org/solr/api/org/apache/solr/request/SimpleFacets.html


- Original Message 
From: Erick Erickson [EMAIL PROTECTED]
To: java-user@lucene.apache.org
Sent: Wednesday, March 7, 2007 2:29:14 PM
Subject: Re: Term Frequency within Hits

See TermFreqVector, HitCollector, perhaps TopDocs, perhaps
TermEnum. Make sure you create your index such that frequencies
are stored (see the FAQ).

Erick

On 3/7/07, teramera [EMAIL PROTECTED] wrote:


 So after I execute a search I end up with a 'Hits' object. The number of
 Hits
 is the order of a million.
 What I want to do is from these Hits is extract term frequencies for a few
 known fields. I don't have a global list of terms for any of the fields
 but
 want to generate  the term frequency based on terms from the Hits.

 Iterating over the hits and doing this later is of course turning out to
 be
 very expensive.
 Is there a known Lucene way of solving such a problem so that this
 calculation happens as the hits are being accumulated?
 Appreciate any pointers,

 --
 View this message in context:
 http://www.nabble.com/Term-Frequency-within-Hits-tf3364987.html#a9362169
 Sent from the Lucene - Java Users mailing list archive at Nabble.com.


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]






-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Plural word search

2007-03-08 Thread Chris Hostetter

: Thanks for quick response. Is there any code example i can take look? I'm
: not familiar with the technique you mentioned. My question is how the
: analyzer knows buss is not a plural and bears is a plural.

Stemming is a vast topic of text analysis .. some stemmers work using
dictionaries, some based on algorithmic appraoches ... almost any Stemmer
you can imagine can be implimented as a ToeknFilter in Lucene -- and a few
already are out of the box.

You might wnat to read up a little bit on the different Stemming
approaches out there (google: stemming) and then take a look at some of
the Lucene analysis classes that provide implementations.





-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene 2.1, inconsistent phrase query results with slop

2007-03-08 Thread Erick Erickson

Sorry about that. I think II found the diagram you're talking about on page
89.
It even addresses the exact problem I'm talking about.

It's not the first time I've looked like a fool, you'd think I'd be getting
used to it by now G.

So, it seems like the most reasonable solution to this issue would
be for me to re-write the phrase queries as SpanNear queries, no?

Erick

On 3/8/07, Chris Hostetter [EMAIL PROTECTED] wrote:



: I think that's working as designed.   Although I could understand
: someone wanting it to work differently.  The slop is sort of like the
: edit distance from the current given phrase, hence the order of terms
: in the phrase matters.

correct ... LIA has a great diagram explaining this ... the slop refers to
how many positions you have to move the terms in the PhraseQuery to match.



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: Plural word search

2007-03-08 Thread Erick Erickson

as of 2.1, as I remember, you can use leading wildcards but ONLY
you set a flag (see setAllowLeadingWildcard in QueryParser). Be
aware of the TooManyClauses issue though (search the mail
archive and you'll find many discussions of this issue).

Erick

On 3/8/07, Tony Qian [EMAIL PROTECTED] wrote:



Sachin,

Thanks for quick response. Is there any code example i can take look? I'm
not familiar with the technique you mentioned. My question is how the
analyzer knows buss is not a plural and bears is a plural.

Lucene supports wildcard. However, we can not use wildcard at the
beginning
of search term such as *bear. is there a way to match *bear* (bear, bears,
forbearance etc.) by search tern bear?

thanks

From: Kainth, Sachin [EMAIL PROTECTED]
Reply-To: java-user@lucene.apache.org
To: java-user@lucene.apache.org
Subject: RE: Plural word search
Date: Thu, 8 Mar 2007 17:14:02 -

Hi Tony,

Lucene certainly does support it.  It just requires you to use a
tokeniser that performs stemming such as any analyzer that uses
PorterStemFilter.

Sachin

-Original Message-
From: Tony Qian [mailto:[EMAIL PROTECTED]
Sent: 08 March 2007 16:52
To: java-user@lucene.apache.org
Subject: Plural word search

All,

I'm evaluating Lucene as a full-text search engine for a project. I got
one of the requirements as following:

4) Plural Literal Search
If you use the plural of a term such as bears the results will include
matches to the plural term bears as well as the singular term bear.

it seems to me we need to build a dictionary to support it. Does Lucene
support it?

appreciate your help.

Tony

_
Don't miss your chance to WIN 10 hours of private jet travel from
Microsoft(r) Office Live
http://clk.atdmt.com/MRT/go/mcrssaub0540002499mrt/direct/01/


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



This email and any attached files are confidential and copyright
protected.
If you are not the addressee, any dissemination of this communication is
strictly prohibited. Unless otherwise expressly agreed in writing,
nothing
stated in this communication shall be legally binding.

The ultimate parent company of the Atkins Group is WS Atkins plc.
Registered in England No. 1885586.  Registered Office Woodcote Grove,
Ashley Road, Epsom, Surrey KT18 5BW.

Consider the environment. Please don't print this e-mail unless you
really
need to.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


_
Find a local pizza place, movie theater, and more….then map the best
route!
http://maps.live.com/?icid=hmtag1FORM=MGAC01


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: Plural word search

2007-03-08 Thread Tony Qian


Erick,

thanks for information.

Tony


From: Erick Erickson [EMAIL PROTECTED]
Reply-To: java-user@lucene.apache.org
To: java-user@lucene.apache.org
Subject: Re: Plural word search
Date: Thu, 8 Mar 2007 13:42:00 -0500

as of 2.1, as I remember, you can use leading wildcards but ONLY
you set a flag (see setAllowLeadingWildcard in QueryParser). Be
aware of the TooManyClauses issue though (search the mail
archive and you'll find many discussions of this issue).

Erick

On 3/8/07, Tony Qian [EMAIL PROTECTED] wrote:



Sachin,

Thanks for quick response. Is there any code example i can take look? I'm
not familiar with the technique you mentioned. My question is how the
analyzer knows buss is not a plural and bears is a plural.

Lucene supports wildcard. However, we can not use wildcard at the
beginning
of search term such as *bear. is there a way to match *bear* (bear, bears,
forbearance etc.) by search tern bear?

thanks

From: Kainth, Sachin [EMAIL PROTECTED]
Reply-To: java-user@lucene.apache.org
To: java-user@lucene.apache.org
Subject: RE: Plural word search
Date: Thu, 8 Mar 2007 17:14:02 -

Hi Tony,

Lucene certainly does support it.  It just requires you to use a
tokeniser that performs stemming such as any analyzer that uses
PorterStemFilter.

Sachin

-Original Message-
From: Tony Qian [mailto:[EMAIL PROTECTED]
Sent: 08 March 2007 16:52
To: java-user@lucene.apache.org
Subject: Plural word search

All,

I'm evaluating Lucene as a full-text search engine for a project. I got
one of the requirements as following:

4) Plural Literal Search
If you use the plural of a term such as bears the results will include
matches to the plural term bears as well as the singular term bear.

it seems to me we need to build a dictionary to support it. Does Lucene
support it?

appreciate your help.

Tony

_
Don't miss your chance to WIN 10 hours of private jet travel from
Microsoft(r) Office Live
http://clk.atdmt.com/MRT/go/mcrssaub0540002499mrt/direct/01/


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



This email and any attached files are confidential and copyright
protected.
If you are not the addressee, any dissemination of this communication is
strictly prohibited. Unless otherwise expressly agreed in writing,
nothing
stated in this communication shall be legally binding.

The ultimate parent company of the Atkins Group is WS Atkins plc.
Registered in England No. 1885586.  Registered Office Woodcote Grove,
Ashley Road, Epsom, Surrey KT18 5BW.

Consider the environment. Please don't print this e-mail unless you
really
need to.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


_
Find a local pizza place, movie theater, and more….then map the best
route!
http://maps.live.com/?icid=hmtag1FORM=MGAC01


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




_
Find a local pizza place, movie theater, and more….then map the best route! 
http://maps.live.com/?icid=hmtag1FORM=MGAC01



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index a source, but not store it... can it be done?

2007-03-08 Thread Jason Pump
If you store a hash code of the word rather then the actual word you 
should be able to search for stuff but not be able to actually retrieve 
it; you can trade precision for security based on the number of bits 
in the hash code ( e.g. 32 or 64 bits). I'd think a 64 bit hash would be 
a reasonable midpoint.


hash64(dog) = 4312311231123121;

body:4312311231123121 returns document with dog, but also any other 
document with a word that hashes to the same value.



Walt Stoneburner wrote:

Have an interesting scenario I'd like to get your take on with respect
to Lucene:

A data provider (e.g. someone with a private website or corporately
shared directory of proprietary documents) has requested their content
be indexed with Lucene so employees can be redirected to it, but
provisionally -- under no circumstance should that content be stored
or recreated from the index.

Is that even possible?

The data owner's request makes sense in the context of them wanting to
retain full access control via logins as well as collecting access
metrics.

If the token 'CAT' points to C:\Corporate\animals.doc and the token
'DOG' points also points there, then great, CAT AND DOG will give that
document a higher rating, though it is not possible to reconstruct
(with any great accuracy) what the actual document content is.

However, if for the sake of using the NEAR operator with Lucene the
tokens are stored as  LET'S:1 SELL:2 CAT:3 AND:4 DOG:5 ROBOT:6 TOYS:7
THIS:8 DECEMBER:9 ... then someone could pull all tokens for
animal.doc and reconstitute the token stream.

Does Lucene have any kind of trade off for working with secure (and
I use this term loosely) data?

-wls

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: A solution to HitCollector-based searches problems

2007-03-08 Thread oramas martín

Hello,

I have just added some search implementation samples based on this collector
solution, to easy the use and understanding or it:

   - KeywordSearch: Extract the terms (and frequency) found in a list of
fields
from the results of a query/filter search

   - GoogleSearch: Return an ordered search result grouped a la Google,
based
   on the terms found in a list of fields

   - GetFieldNamesOp: Operation to mimic the getFieldNames method of
  IndexReader but using a searcher. With it, it is
possible
  to explore the fields of remote indexes.

See http://sourceforge.net/projects/lucollector/ for the source code (
lu-collector-src-sampleop-0.8.zip).

Regards,
José L. Oramas

On 2/26/07, oramas martín [EMAIL PROTECTED] wrote:



Hello,

As you probably know, the HitCollector-based search API is not meant to
work remotely, because it will generate a RPC-callback for every non-zero
score.

There is another problem with MultiSearcher-HitCollector-based search
which knows nothing about mix HitCollector based searches (not to say it has
hardcode the way to mix TopDocs for the score and for the Sort searches).
Also the ParallelMultiSearcher inherits this problems and is unable to
parallelize the HitCollector-based searcher.

A final problem with the HitCollector-based search is related to the lost
of a limit in the results, as the Hits class implements thought the
getMoreDocs() function, and lazy loading and caching of documents it does.


To solve those problems it is necessary a factory (HitCollectorSource)
able to generate collectors for single (SingleHitCollector) an multi
(MultiHitCollector) searches, and a new search method in the
Searchable interface for it. To avoid modifications to the lucene core, the
later requirement is NOT IMPLEMENTED in the library I have just created.
Instead, an ugly solution, a wrapper for those searchers
(SearcherHCSourceWrapper) and a Filter wrapper (FilterHitCollectorSource) to
carry the factory-based searches, is provided.

Each collector is based in a two steps sequence, one for collecting hits
or subsearcher results, and another for generating the final result.

Also, just in case you don't want to add a wrapper to each searcher of
your project, there is an adapted version of IndexSearcher, MultiSearcher
and ParallelMultiSearcher (only for version 2.1) modified exactly the same
way the wrapper class SearcherHCSourceWrapper does. Just put them in your
class-path (before the Lucene core jar) and you will be using the new
collector interfaces without modifying your code.

There are some unit testing (copied and adapted from the Lucene 
2.1distribution).

See http://sourceforge.net/projects/lucollector/ for the jar files and the
code.

If you find it interesting to complement the Lucene project, tell me how
to put it in the contribution area.

Regards,
José L. Oramas



Re: Negative Filtering (such as for profanity)

2007-03-08 Thread Grant Ingersoll
I _think_ Lucene 2.1 (or is it trunk?, I lose track) has the ability  
to delete all documents containing a term.  So, every time you update  
your profanity list, you could iterate over it and remove all  
documents that contain the terms.


If a user can never get these documents via a query, then I don't see  
any reason to allow them in the index to begin with.


Also, I don't use QueryFilters much, but  I'm curious as to how they  
perform on that many docs.



On Mar 7, 2007, at 5:38 PM, Greg Gershman wrote:

I thought about this, as I think overall the resources required  
would be less than creating a filter.  Ultimately I decided against  
it for a few reasons:
1) I'm working with an existing index of ~50 million documents, I  
don't want to reindex the whole thing, or even just the documents  
that contain profanity, if I can avoid it.
2) Filtering at indexing time means I can't effectively add new  
words to the profanity list without reindexing.


Good suggestion, though, I appreciate it.

Greg

- Original Message 
From: Grant Ingersoll [EMAIL PROTECTED]
To: java-user@lucene.apache.org
Sent: Wednesday, March 7, 2007 2:07:38 PM
Subject: Re: Negative Filtering (such as for profanity)

Not sure if this helpful given your proposed solution, but could you
do something on the indexing side, such as:

1.  Remove the profanity from the token stream, much like a
stopword.  This would also mean stripping it from the display text
2. If your TokenFilter comes across a profanity, somehow mark the
document as containing a profanity via a profanity Field (not sure
if there is a way, in Lucene, to add another Field while you are in
the analysis phase, but you could also have it update a table in a db
or something.)  Then on search, you could just say (regular query)
+profanity:false

HTH,
Grant

On Mar 7, 2007, at 10:07 AM, Greg Gershman wrote:


I'm attempting to create a profanity filter.  I thought to use a
QueryFilter created with a Query of (-$#!+ AND [EMAIL PROTECTED] AND etc).  The
problem I have run into is that, as a pure negative query is not
supported (a query for (-term) DOES NOT return the inverse of a
query for (term)), I believe the bit set returned by a purely
negative QueryFilter is empty, so no matter how many results
returned by the initial query, the result after filtering is always
zero documents.

I was wondering if anyone had suggestions as to how else to do
this.  I've considered simply amending the query string submitted
by the user to include a pre-generated String that would exclude
the query terms, but I consider this a non-elegant solution.  I had
also thought about creating a new sub-class of QueryFilter,
NegativeQueryFilter.  Basically, it would works just like a
QueryFilter, taking a positive query (so, I would pass it an OR'ed
list of profane words), then the resulting bits are simply
flipped.  I think this would work, unless I'm missing something.
I'm going to experiment with it, I'd appreciate anyone's thoughts
on this.

Thanks,

Greg





_ 
_

__
It's here! Your new message!
Get new email alerts with the free Yahoo! Toolbar.
http://tools.search.yahoo.com/toolbar/features/mail/


--
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/
LuceneFAQ



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]










__ 
__

Need Mail bonding?
Go to the Yahoo! Mail QA for great tips from Yahoo! Answers users.
http://answers.yahoo.com/dir/?link=listsid=396546091


--
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



one Field in many documents

2007-03-08 Thread new333333
Hi,

I have to index many documents with the same fields (only one or two
fields are different). Can I add a field (Field instance) to many
documents? It seams to work but I'm not sure if this is the right way...

Thank you


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: one Field in many documents

2007-03-08 Thread Doron Cohen
In general I would say this is not safe, because it seems to assume too
much about the implementation - and while it might in most cases currently
work, the implementation could change and the program assuming this would
stop working. It would most probably not work correctly right from the
start for fields constructed with a Reader.

Regards,
Doron

[EMAIL PROTECTED] wrote on 08/03/2007 12:56:33:

 Hi,

 I have to index many documents with the same fields (only one or two
 fields are different). Can I add a field (Field instance) to many
 documents? It seams to work but I'm not sure if this is the right way...

 Thank you


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: one Field in many documents

2007-03-08 Thread Michael D. Curtin

[EMAIL PROTECTED] wrote on 08/03/2007 12:56:33:

I have to index many documents with the same fields (only one or two
fields are different). Can I add a field (Field instance) to many
documents? It seams to work but I'm not sure if this is the right way...

What does many mean in this context?  If it means most, or all, 
perhaps it would be better not to index those fields at all -- they 
would be adding little or nothing, in terms of information content.


--MDC

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index a source, but not store it... can it be done?

2007-03-08 Thread Chris Hostetter
: If you store a hash code of the word rather then the actual word you
: should be able to search for stuff but not be able to actually retrieve

that's a really great solution ... it could even be implemented asa
TokenFilter so none of your client code would ever even need to know that
it was being used (just make sure it comes last after any stemming or what
not)




-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index a source, but not store it... can it be done?

2007-03-08 Thread Mike Klaas

On 3/8/07, Chris Hostetter [EMAIL PROTECTED] wrote:

: If you store a hash code of the word rather then the actual word you
: should be able to search for stuff but not be able to actually retrieve

that's a really great solution ... it could even be implemented asa
TokenFilter so none of your client code would ever even need to know that
it was being used (just make sure it comes last after any stemming or what
not)


I don't know... hashing individual words is an extremely weak form of
security that should be breakable without even using a computer... all
the statistical information is still there (somewhat like 'encrypting'
a message as a cryptoquote).

Doron's suggestion is preferable: eliminate token position information
from the index entirely.

-Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: indexing pdfs

2007-03-08 Thread ashwin kumar

hi sachin the link wat u gave me only a zip file and an exe file for
downoad. and this zip file also contains no class files.but wouldn't we be
requiring a jar file or class file ???

On 3/8/07, Kainth, Sachin [EMAIL PROTECTED] wrote:


Hi,

Here it is:

http://www.seekafile.org/

-Original Message-
From: ashwin kumar [mailto:[EMAIL PROTECTED]
Sent: 08 March 2007 13:07
To: java-user@lucene.apache.org
Subject: Re: indexing pdfs

hi again
do we have to download any jar files to run this program if so can u
give me the link pls

ashwin

On 3/8/07, Kainth, Sachin [EMAIL PROTECTED] wrote:

 Well you don't need to actually save the text to disk and then index
 the saved index file, you can directly index that text in-memory.

 The only other way I have heard of is to use Ifilters.  I believe
 SeekAFile does indexing of pdfs.

 Sachin

 -Original Message-
 From: ashwin kumar [mailto:[EMAIL PROTECTED]
 Sent: 08 March 2007 11:35
 To: java-user@lucene.apache.org
 Subject: Re: indexing pdfs

 Is the only way index pdfs is to convert it into a text and then only
 index it ???



 On 3/8/07, Kainth, Sachin [EMAIL PROTECTED] wrote:
 
  Hi Aswin,
 
  You can try pdfbox to convert the pdf documents to text and then use

  Lucene to index the text.  The code for turning a pdf to text is
  very
  simple:
 
  private static string parseUsingPDFBox(string filename)
  {
  // document reader
  PDDocument doc = PDDocument.load(filename);
  // create stripper (wish I had the power to do that -
  wouldn't leave the house)
  PDFTextStripper stripper = new PDFTextStripper();
  // get text from doc using stripper
  return stripper.getText(doc);
  }
 
  Sachin
 
  -Original Message-
  From: ashwin kumar [mailto:[EMAIL PROTECTED]
  Sent: 08 March 2007 09:37
  To: java-user@lucene.apache.org
  Subject: indexing pdfs
 
  hi can some one help me by giving any sample programs for indexing
  pdfs and .doc files
 
  thanks
  regards
  ashwin
 
 
  This message has been scanned for viruses by MailControl - (see
  http://bluepages.wsatkins.co.uk/?6875772)
 
 
  This email and any attached files are confidential and copyright
  protected. If you are not the addressee, any dissemination of this
  communication is strictly prohibited. Unless otherwise expressly
  agreed in writing, nothing stated in this communication shall be
 legally binding.
 
  The ultimate parent company of the Atkins Group is WS Atkins plc.
  Registered in England No. 1885586.  Registered Office Woodcote
  Grove, Ashley Road, Epsom, Surrey KT18 5BW.
 
  Consider the environment. Please don't print this e-mail unless you
  really need to.
 
  
  - To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: Index a source, but not store it... can it be done?

2007-03-08 Thread Chris Hostetter

: I don't know... hashing individual words is an extremely weak form of
: security that should be breakable without even using a computer... all
: the statistical information is still there (somewhat like 'encrypting'
: a message as a cryptoquote).
:
: Doron's suggestion is preferable: eliminate token position information
: from the index entirely.

i guess i wasn't thinking about this as a security issue, more a
discouragement issue ... reconstructing a doc from term vectors is easy,
reconstructing it from just term positions is harder but not impossible,
reconstructing from hashed tokens requires a lot of hard work.

if the issue is thta you want to be abel to ship an index that people can
manipulate as much as they want and you want to garuntee they can never
reconstruct the original docs you're pretty much screwed ... even if you
eliminate all of the position info statistical info about language
structure can help you gleen a lot about hte source data.

i'm not crypto expert, but i imagine it would probably take the same
amount of statistical guess work to reconstruct meaningful info from
either approach (hashing hte individual words compared to eliminating the
positions) so i would think the trade off of supporting phrase queries
would make the hasing approach more worthwhile.

i mean afterall: you still wnat the index to be useful for searching
right? ... if you are really paranoid don't just strip the positions,
strip all duplicate terms as well to prevent any attempt at statistical
sampling ... but now all you relaly have is a lookup table of word to
docid with no tf/idf or position info to improve scoring, so why bother
with Lucene, jsut use a BerkleyDB file to do your lookups.


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene Ranking/scoring

2007-03-08 Thread Chris Hostetter

: Do I have this right? I got bit confused at first because I assumed that the
: actual field values were being used in the computation, but you really need
: to know the unique term count in order to get the score 'right'.

you can use the actual values in FunctionQueries, except that:
  1) dates aren't numeric values that lend themselves well to functions
  2) the ReverseOrdinalValueSource comes in handy when you want the docs
with the highest value (ie: most recent date) to be special (ie: to plug
into your reciprical function and get the max value.

i suppose you could write a ValueSource that finds the max value of a
field and then a ValueSource that normalizes all the values of one
valuesource against the value(s) of another value source ... but no one
has done that yet (and it still wouldn't have a lot of meaning for dates)


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index a source, but not store it... can it be done?

2007-03-08 Thread Mike Klaas

On 3/8/07, Chris Hostetter [EMAIL PROTECTED] wrote:


if the issue is thta you want to be abel to ship an index that people can
manipulate as much as they want and you want to garuntee they can never
reconstruct the original docs you're pretty much screwed ... even if you
eliminate all of the position info statistical info about language
structure can help you gleen a lot about hte source data.


True.


i'm not crypto expert, but i imagine it would probably take the same
amount of statistical guess work to reconstruct meaningful info from
either approach (hashing hte individual words compared to eliminating the
positions) so i would think the trade off of supporting phrase queries
would make the hasing approach more worthwhile.


I suppose it also depends on how much access the user has to the
index.  If they have access to the physical index and means of
querying it, then they have access to the hashing algo (and/or key)
and so it is worthless.  If they don't, and their access is strictly
through queries, then I don't see what help hashing will provide, as
the result of any given query should be the same, hashing or not.


i mean afterall: you still wnat the index to be useful for searching
right? ... if you are really paranoid don't just strip the positions,
strip all duplicate terms as well to prevent any attempt at statistical
sampling ... but now all you relaly have is a lookup table of word to
docid with no tf/idf or position info to improve scoring, so why bother
with Lucene, jsut use a BerkleyDB file to do your lookups.


You could also do both.  Another thing that might help is relatively
aggressive stop word removal.  All these measures will raise the
discouragement bar slightly.

-Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



FieldCache: flush cache explicitly

2007-03-08 Thread John Wang

I think the api should allow for explicitly flush the fieldcache.

I have a setup where new readers are being loaded very some period of
time. I don't want to rely on Java WeakHashMap to free the cache, I
want to be able to do it in a deterministic way.

It would be great if this can be added to Lucene, I can create a bug
if the Lucene gods agree to it :)

Thanks

-John

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]