Re: How to get hits coordinates in Lucene 4.4.0

2013-08-18 Thread Karl Wettin
On Aug 13, 2013, at 12:55 PM, Michael McCandless wrote: I'm less familiar with the older highlighters but likely it's possible to get the absolute offsets from them as well. Using vector highlighter I've achieved that by extending and cloning the code of

A couple of thoughts on non technical users and query parsers.

2013-05-30 Thread Karl Wettin
Non technical users understand what a field is. All of them might however not know that they they can use them but It's easy for them to learn that name:john will search for john only in names. Non technical users can learn to understand that logic and functionality can be specified in their

Re: Blåbærsyltetøy v.s. Räksmörgås

2013-05-23 Thread Karl Wettin
22 maj 2013 kl. 20:29 skrev Petite Abeille: On May 22, 2013, at 7:08 PM, Karl Wettin karl.wet...@kodapan.se wrote: * Use a filter after ASCIIFoldingFilter that discriminate all use of ae, oe, oo, and other combination of double vowels, just keeping the first one. I ended up

Blåbærsyltetøy v.s. Räksmörgås

2013-05-22 Thread Karl Wettin
This is a question (or perhaps a line of thought) regarding the mutually intelligible Scandinavian languages Danish, Norwegian and Swedish. The Swedish letters åäö is in fact the same letters as the Danish/Norwegian åæø. A Norwegian writing about the Swedish city of Göteborg write Gøteborg and

Re: Blåbærsyltetøy v.s. Räksmörgås

2013-05-22 Thread Karl Wettin
22 maj 2013 kl. 14:37 skrev Karl Wettin: * Use a filter after ASCIIFoldingFilter that discriminate all use of ae, oe, oo, and other combination of double vowels, just keeping the first one. I ended up with that solution. https://issues.apache.org/jira/browse/LUCENE-5013

Re: Best practices in boosting by proximity?

2013-05-04 Thread Karl Wettin
The most simple solution is to use of slop in PhraseQuery, SpanNearQuery, etc(?). Also consider permutations of #isInOrder() with alternative query boosts. Even though slop will create a greater score the closer the terms are, it might still in some cases (usually when combined with other

Re: Best practices in boosting by proximity?

2013-05-04 Thread Karl Wettin
something like your proximity query~20, but consider the cost of a great slop. 4 maj 2013 kl. 20:41 skrev Karl Wettin: The most simple solution is to use of slop in PhraseQuery, SpanNearQuery, etc(?). Also consider permutations of #isInOrder() with alternative query boosts. Even though

Re: Reg Lucene Naive Bayesian classifier.

2013-01-14 Thread Karl Wettin
14 jan 2013 kl. 14:53 skrev VIGNESH S: Anyone Used the Naive Bayesian Classifier? It will be really helpful if some one Can post how to use the classifiers in Lucene .. Hi there, I posted a NB classifier in the jira back in 2007 that use Lucene as data matrix. It probably needs a bit of

Re: SSD Experience

2011-08-22 Thread Karl Wettin
22 aug 2011 kl. 18.49 skrev Rich Cariens: I found a Lucene SSD performance benchmark dochttp://wiki.apache.org/lucene-java/SSD_performance?action=AttachFiledo=viewtarget=combined-disk-ssd.pdfbut the wiki engine is refusing to let me view the attachment (I get You are not allowed to do

Re: negative wildcard query

2011-06-29 Thread Karl Wettin
You'll also need things to exclude from, eg a MatchAllDocsQuery. karl 29 jun 2011 kl. 17.25 skrev Clemens Wyss: Say I have a document with field f1. How can I search Documents which have not test in field f I tried: -f: *test* f: -*test* f: NOT *test* but no luck. Using

Re: Lemmatization

2011-06-08 Thread Karl Wettin
Perhaps least frequent substring or even suffix truncation might be enough for your needs. Here is a related paper: http://web.jhu.edu/bin/q/b/p75-mcnamee.pdf karl On Jun 8, 2011, at 1:52 PM, Mohamed Yahya wrote: You're right. Still, I am not sure if there is a library that would

Re: [POLL] Where do you get Lucene/Solr from? Maven? ASF Mirrors?

2011-01-19 Thread Karl Wettin
On Jan 18, 2011, at 10:04 PM, Grant Ingersoll wrote: As devs of Lucene/Solr, due to the way ASF mirrors, etc. works, we really don't have a good sense of how people get Lucene and Solr for use in their application. Because of this, there has been some talk of dropping Maven support for

Re: [SOLR] DisMaxQParserPlugin and Tokenization

2010-11-23 Thread Karl Wettin
22 nov 2010 kl. 10.56 skrev jan.kure...@nokia.com jan.kure...@nokia.com: Using the SearchHandler with the deftype=”dismax” option enables the DisMaxQParserPlugin. From investigating it seems, it is just tokenizing by whitespace. Although by looking in the code I could not find the place,

Re: Fuzzy Phrase

2010-09-27 Thread Karl Wettin
There is a SpanFuzzyQuery for Lucene 1.9 from 2006 in LUCENE-522. karl 27 sep 2010 kl. 00.19 skrev Fabiano Nunes: Thank you, Schindler. When combining queries, I need two strings, one for each field. I want to use just one string like -- head:hello~ world~3 AND contents:colorless~

Re: instantiated contrib

2010-08-27 Thread Karl Wettin
is a litter term is very frequent and other term is very rare. 2010/8/27 Karl Wettin karl.wet...@gmail.com: My mail client died while sending this mail.. Sorry for any duplicate. It is strange that it should take 20 second to gather fields, this is the only thing that really suprises me. I'd

Re: instantiated contrib

2010-08-26 Thread Karl Wettin
My mail client died while sending this mail.. Sorry for any duplicate. It is strange that it should take 20 second to gather fields, this is the only thing that really suprises me. I'd expect it to be instant compared to RAMDirectory. It is hard to say from the information you provided.

Re: Hot to get word importance in lucene index

2010-07-23 Thread Karl Wettin
Hi, Please define important. Important to do what? It would probably be helpful if you explained what it is you attempt to achieve by doing this. Perhaps there is something in MoreLikeThis that will help you? karl 23 jul 2010 kl. 04.44 skrev Xaida: Hi all! hmmm, i need to

Re: Reverse Lucene queries

2010-07-23 Thread Karl Wettin
23 jul 2010 kl. 08.30 skrev sk...@sloan.mit.edu: Hi all, I have an interesting problem...instead of going from a query to a document collection, is it possible to come up with the best fit query for a given document collection (results)? Best fit being a query which maximizes the hit scores of

Re: Hot to get word importance in lucene index

2010-07-23 Thread Karl Wettin
Are you perhaps looking for this: http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/similar/MoreLikeThis.html ? karl 23 jul 2010 kl. 10.54 skrev Xaida: Hi! thanks for reply! I will try to explain better, sorry if it was unclear. I have user text document

Re: about contrib instantiated

2010-07-03 Thread Karl Wettin
2 jul 2010 kl. 08.32 skrev Li Li: I have an index of about 8,000,000 document and the current index size is about 30GB. Is it possbile to use this contrib to speed up my search? I have enough memory for it. In order to answer your question you'll need to benchmark using a lot of typical

Re: Lucene Partition Size

2010-04-09 Thread Karl Wettin
NFS to EMC Celera devices. (NFS 3) - The drives are 300 gb fiber attached with 10,000 rpm. Thanks, Ivan --- On Thu, 4/8/10, Karl Wettin karl.wet...@gmail.com wrote: From: Karl Wettin karl.wet...@gmail.com Subject: Re: Lucene Partition Size To: java-user@lucene.apache.org Date: Thursday, April 8

Re: Lucene Partition Size

2010-04-08 Thread Karl Wettin
8 apr 2010 kl. 20.05 skrev Ivan Provalov: We are using Lucene for searching of 200+ mln documents (periodical publications). Is there any limitation on the size of the Lucene index (file size, number of docs, etc...)? The only such limitation in Lucene I'm aware of is Integer.MAX_VALUE

Re: query: order of search

2010-04-01 Thread Karl Wettin
1 apr 2010 kl. 11.21 skrev suman.hol...@zapak.co.in suman.hol...@zapak.co.in : its written to do a search within search, so that the second search is constrained by the results of the first query If I understand your needs you could while collecting search results populate a new filter

Re: InstantiatedIndex performance

2010-03-31 Thread Karl Wettin
31 mar 2010 kl. 10.21 skrev Michael Stoppelman: I was wondering why the InstantiatedIndex gets very slow as the number of documents increases in the index. I've been looking at the source and have only found comments saying it's slow when the index is big but not why. Do folks just run

Re: Lucene as a primary datastore

2010-01-20 Thread Karl Wettin
20 jan 2010 kl. 04.58 skrev Guido Bartolucci: Am I just ignorant and scared of Lucene and too trusting of Oracle and MySQL? Since all your comparations is with relational databases I feel obligated to say what has been said so many times on this list: Lucene is an index and not a

Re: Extracting contact data

2010-01-13 Thread Karl Wettin
Lucene will probably only be helpful if you know what you are looking for, e.g. that you search for a given person, a given street and given time intervals. Is this what you want to do? If you instead are looking for a way to really extract any person, street and time interval that a

Re: Text extraction from ms word doc

2010-01-11 Thread Karl Wettin
Have you tried antiword? http://www.winfield.demon.nl/ karl 11 jan 2010 kl. 21.04 skrev maxSchlein: I was looking for an option for Text extraction from a word doc. Currently I am using POI; however, when there is a table in the doc, for each column POI brings back a . The

Re: about optimize() quetion ,Looking forward to hearing from you soon! Thank you in advance!

2010-01-03 Thread Karl Wettin
3 jan 2010 kl. 13.33 skrev luocanrao: 1、if the readers do not call re-open, segment file the readers will see is after merged or before merged when optimize() done 2、when old segment file on disk is removed,if old segment files are removed after optimize() done at once, How can the

Re: NumericRangeQuery performance with 1/2 billion documents in the index

2010-01-03 Thread Karl Wettin
3 jan 2010 kl. 16.32 skrev Yonik Seeley: Perhaps this is just a huge index, and not enough of it can be cached in RAM. Adding additional clauses to a boolean query incrementally destroys locality. 104GB of index and 4GB of RAM means you're going to be hitting the disk constantly. You

Re: Copy and augment an indexed Document

2010-01-03 Thread Karl Wettin
31 dec 2009 kl. 02.19 skrev Erick Erickson: It is possible to reconstruct a document from the terms, but it's a lossy process. Luke does this (you can see from the UI, and the code is available). There's no utility that I know of to make this easy.

Re: MatchAllDocsQuery and InstantiatedIndex on Lucene 2.9.1

2009-12-10 Thread Karl Wettin
https://issues.apache.org/jira/browse/LUCENE-2144 9 dec 2009 kl. 23.22 skrev Uwe Schindler: This is a bug in InstantiatedIndex. The termDoc(null) was added to get all documents. This was never implemented in Instantiated Index. Can you open an issue? There maybe other queries fail because

Re: search problem

2009-10-29 Thread Karl Wettin
29 okt 2009 kl. 12.12 skrev m.harig: i've a doubt in search , i've a word in my index welcomelucene (without spaces) , when i search for welcome lucene(with a space) , am not able to get the hits. It should pick the document welcomelucene.. is there anyway to do it ? i've used

Re: XorReader?

2009-10-22 Thread Karl Wettin
22 okt 2009 kl. 20.00 skrev Chris Hostetter: : I'm thinking a decorator with deletions on top of the original reader, merged : with the clone reader using a MultiReader. But this would still require a new you don't really mean a clone do you? ... you should just need a very small index

XorReader?

2009-10-21 Thread Karl Wettin
Hi people, I have an application in which the users are allowed to make changes to the database, changes visible only to that user. I.e. they don't modify the original data, they create a clone of the original. When the user request the instance I retrieve the modified clone rather than

Re: Need to know pros and cons of using RAMDirectory

2009-10-17 Thread Karl Wettin
Hi, you should probably ask your self why your performance is bad before looking at solving it by scaling hardware. I.e. what are your application needs, how so you solve you needs at index/query time and how can you replace this with something better? If you tell us a bit more about

Re: Using TermVectorMapper to compute term frequency across documents

2009-10-15 Thread Karl Wettin
14 okt 2009 kl. 15.15 skrev Grant Ingersoll: On Oct 12, 2009, at 10:46 PM, Thomas D'Silva wrote: I am trying to compute the counts of terms of the documents returned by running a query using a TermVectorMapper. I was wondering if anyone knew if there was a faster way to do this rather

Re: Reverse stemmer?

2009-10-08 Thread Karl Wettin
For the case where the text contains mixed languages there are solutions that simutainously use morphological rules of two or more languages. Coveo search does this but I don't know what their solution looks like. I suppose one way to do it would be to stem all tokens with all algorithms

Re: Phase Extraction, mainly for English

2009-10-06 Thread Karl Wettin
Hi Andrew, I think you are looking for the shingle package in contrib/analyzers. karl 6 okt 2009 kl. 13.42 skrev Andrew Zhang: Hi guys, The requirement is very simple here, e.g. for this sentence, 'The NBA formally announced its new *social media* guidelines Wednesday', I want to

Re:InstantiatedIndex questions

2009-10-06 Thread Karl Wettin
6 okt 2009 kl. 18.54 skrev David Causse: David, your timing couldn't be better. Just the other day I proposed that we deprecate InstantiatedIndexWriter. The sum of the reasons to this is that I'm a bit lazy. Your mail makes me reconsider. https://issues.apache.org/jira/browse/LUCENE-1948

Re: Phase Extraction, mainly for English

2009-10-06 Thread Karl Wettin
enough. Regards, Andrew On Tue, Oct 6, 2009 at 11:51 PM, Karl Wettin karl.wet...@gmail.com wrote: Hi Andrew, I think you are looking for the shingle package in contrib/analyzers. karl 6 okt 2009 kl. 13.42 skrev Andrew Zhang: Hi guys, The requirement is very simple here, e.g

Re: Help understanding fieldNorm

2009-10-05 Thread Karl Wettin
Hi Ole-Martin, how many characters was it in the url in before and after update? karl 5 okt 2009 kl. 10.21 skrev Ole-Martin Mørk: Hi. I am trying to understand Lucene's scoring algorithm. We're getting some strange results. First we search for a given page by it's url. We get this

Re: Help understanding fieldNorm

2009-10-05 Thread Karl Wettin
sorry, I ment title. 5 okt 2009 kl. 11.57 skrev Simon Willnauer: Ole-Martin, did you mention that you did not change the URL value but the title? simon On Mon, Oct 5, 2009 at 11:52 AM, Karl Wettin karl.wet...@gmail.com wrote: Hi Ole-Martin, how many characters was it in the url

Re: Help understanding fieldNorm

2009-10-05 Thread Karl Wettin
of the title was increased by 1, from 41 to 42 characters. -- Ole-Martin Mørk On Mon, Oct 5, 2009 at 12:39 PM, Karl Wettin karl.wet...@gmail.com wrote: sorry, I ment title. 5 okt 2009 kl. 11.57 skrev Simon Willnauer: Ole-Martin, did you mention that you did not change the URL value

Re: Help needed bubbling up relevant records with most recent date

2009-10-02 Thread Karl Wettin
Use a span near query to add boost for the phrases. If you only want to add boost for exact phrases (0 slop) you might want to consider using shingles. In order to add greater score for a date closer in time you can choose between a range of solutions depending on your needs. Using a

Re: Help needed ordering search results

2009-10-01 Thread Karl Wettin
Not quite sure what you ask for, but I think you want to use a span near query (for adding boost to phrases) in a disjunction max query (to define weights of the different fields). karl 1 okt 2009 kl. 02.40 skrev mitu2009: Hi, I've 3 records in Lucene index. Record 1 contains

Re: Whitespace/Standard Analyzer and punctuation

2009-09-30 Thread Karl Wettin
You could look in to modifying the standard tokenizer lexer code to handle punctuation (there is a patch in the isssue tracker for the old javacc grammer to handle punctuation) and there is also the Gate NLP project which has a fairly nice sentence splitter you might find useful. Add a

Re: Memory consumed by IndexSearcher

2009-09-23 Thread Karl Wettin
23 sep 2009 kl. 17.55 skrev Mindaugas Žakšauskas: Luke says: Has deletions? / Optimized? Yes (1614) / No Very quick response, try optimizing your index and see what happends. I'll get back to you unless someone beats me to it. karl

Re: Memory consumed by IndexSearcher

2009-09-23 Thread Karl Wettin
23 sep 2009 kl. 17.55 skrev Mindaugas Žakšauskas: I was kind of hinting on the resource planning. Every decent enterprise application, apart from other things, has to provide its memory requirements, and my point was - if it uses memory, how much of it needs to be allocated? What are the

Re: Memory consumed by IndexSearcher

2009-09-22 Thread Karl Wettin
Hi Mindaugas, it is - as you sort of point out - the readers associated with your searcher that consumes the memory, and not so much the searcher it self. Thing that consume the most memory is probably field norms (8 bits per field and document unless omitted) and flyweighted terms

Re: Help Needed...

2009-05-28 Thread Karl Wettin
28 maj 2009 kl. 12.22 skrev Gaurav Kumar: Hi everyone, I am doing a project using Lucene where i need to index HTML files. I am using Tika to parse HTML files. But i need to index files according to their tags which means that every text present in different HTML tag (like p a) should

Re: Using Lucene for a classification problem

2009-05-19 Thread Karl Wettin
Hi Jeetu, wether or not it makes sense to use Lucene as your data matrix depends a bit on your requirements. There is a Bayesian classifier available in the issue tracker http://issues.apache.org/jira/browse/ LUCENE-1039 that might be helpful, although it does need a little bit of

Re: InstantiatedIndex Memory required

2009-05-13 Thread Karl Wettin
Hi Ravichandra, this is a question better fitted the java-users maillinglist. On this list we talk about the development of the Lucene API rather than how to use it. To answer your question, there is no simple formula that says how much RAM an InstantiatedIndex will consume given the

Re: interpreting scores

2009-05-08 Thread Karl Wettin
. :) Thanks! -Nate On Thu, May 7, 2009 at 7:50 AM, Karl Wettin karl.wet...@gmail.com wrote: Nate, will there always be a correspodning mp3 for any given note sheet? As for analysis, I'd try using ngrams of the complete untokenized file name if I was you. Michael Jackson Don't Stop 'till You Get

Re: interpreting scores

2009-05-08 Thread Karl Wettin
SpellChecker classes be of any use? I really feel like I'm floundering here. I am more than willing to put in the work, I just need a push or two in the right directions. :) Thanks! -Nate On Thu, May 7, 2009 at 7:50 AM, Karl Wettin karl.wet...@gmail.com wrote: Nate, will there always

Re: Lucene Index Encryption

2009-05-08 Thread Karl Wettin
I might be missing something here, but why not just store the index on a cryptographic virtual file system? karl 8 maj 2009 kl. 19.09 skrev peter_lena...@ibi.com peter_lena...@ibi.com : Michael, Thanks for the comments they are very insightful. I hadn't thought about the Random

Re: interpreting scores

2009-05-07 Thread Karl Wettin
Nate, will there always be a correspodning mp3 for any given note sheet? As for analysis, I'd try using ngrams of the complete untokenized file name if I was you. Michael Jackson Don't Stop 'till You Get Enough - ^mic, mich, icha, chae, hael, ael , el j, l ja, and so on. See

Re: Exact match on entire field

2009-05-06 Thread Karl Wettin
You should probably tell us the reason to why you need this functionallity. Given you only load the stored comparative field for the first it doesn't really have to be that expensive. If you know that the first hit was not a perfect match then you know that any matching documents with a

Re: Suggestive Search

2009-04-08 Thread Karl Wettin
For this you probably want to use ngrams. Wether or not this is something that fits in your current index is hard to say. My guess is that you want to create a new index with one document per unique phrase. You might also want to try to load this index in an InstantiatedIndex, that could

Re: Suggestive Search

2009-04-08 Thread Karl Wettin
If you use prefix grams only then you'll get a forward-only suggestion scheme. I've seen several implementation that use that and it works quite well. harry potter: ^ha, ^har, ^harr, ^harry, ^harry p, ^harry po.. harry houdini: ^ha, ^har, ^harr, ^harry, ^harry h, ^harry ho.. I prefere the

Re: Lucene and Phrase Correction

2009-04-06 Thread Karl Wettin
6 apr 2009 kl. 14.59 skrev Glyn Darkin: Hi Glyn, to be able to spell check phrases E.g Harry Poter is converted to Harry Potter We have a fixed dataset so can build indexes/ dictionaries from our own data. the most obvious solution is index your contrib/spell checker with shingles. This

Re: Filters, what's going on under the hood?

2009-04-06 Thread Karl Wettin
6 apr 2009 kl. 15.47 skrev Lebiram: I am thinking of adding search filters to my application thinking that they would more efficient. Can anyone explain what lucene does with search filters? Like, what generally happens when calling search() A filter is a bitset, one bit per document in

Re: Free software for language detection

2009-03-29 Thread Karl Wettin
You can also look at https://issues.apache.org/jira/browse/LUCENE-1039 that I've successfully used for language detection of user queries. karl 27 mar 2009 kl. 18.35 skrev Boris Aleksandrovsky: Lisheng, You might want to look at the Nutch LanguageID plugin

Re: People you might know ( a la Facebook) - *slightly offtopic*

2009-03-24 Thread Karl Wettin
There is even an old thread about this on the Mahout-users list: http://markmail.org/message/ludu5hjfczuvgk3n 17 mar 2009 kl. 15.17 skrev Grant Ingersoll: Have a look at the Lucene sister project: Mahout: http://lucene.apache.org/mahout . In there is the Taste collaborative filtering

Re: Upper limit on number of Fields

2009-02-15 Thread Karl Wettin
15 feb 2009 kl. 16.27 skrev Joel Halbert: Is there any practical limit on the number of fields that can be maintained on an index? My index looks something like this, 1 million documents. For each group of 1000 documents I might have 10 indexed fields. This would mean in total about 1

Re: Partial / starts with searching

2009-02-14 Thread Karl Wettin
? Karl Wettin wrote: If you attach an NgramTokenFilter to your analyzer at index and query time you should be able to query for parts of the word. http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/ngram/NGramTokenFilter.html http://lucene.apache.org/java/2_4_0/api/index.html?org

Re: Partial / starts with searching

2009-02-13 Thread Karl Wettin
Hi again Jori, did you try N-grams as suggested in the reply on -dev? karl 13 feb 2009 kl. 09.05 skrev d-fader: Hi, I've actually posted this message in de dev mailing list earlier, because I though my 'issue' is a limitation of the functionality of Lucene, but they redirected me to

Re: Partial / starts with searching

2009-02-13 Thread Karl Wettin
this :) Jori. Karl Wettin wrote: Hi again Jori, did you try N-grams as suggested in the reply on -dev? karl 13 feb 2009 kl. 09.05 skrev d-fader: Hi, I've actually posted this message in de dev mailing list earlier, because I though my 'issue' is a limitation of the functionality

Re: TermQuery search returns the same Document several times

2009-02-07 Thread Karl Wettin
5 feb 2009 kl. 14.44 skrev Lebiram: If HitCollector only returns a document once then he might be referring to an application ID that is assigned to a field that has been indexed twice or more with different document IDs. I'll clarify this with him. However is there a way to somehow do a

Re: Field.Store.YES Question

2009-02-05 Thread Karl Wettin
5 feb 2009 kl. 09.30 skrev Amin Mohammed-Coleman: Is there a seperate part in the lucene document that the tokenised strings are stored and therefore Lucene knows where to look? Yes. Stored fields is meta data bound to a document, for instance the primary key of the object the Lucene

Re: ShingleMatrixFilter for synonyms

2009-01-14 Thread Karl Wettin
Hi Eric, ShingleMatrixFilter does not add some sort of multiple token synonym feature on top of a plain old Lucene index, it does however create permutations of tokens in a matrix. My suggestion is that you first look at what shingles are and make sure this is something you feel is

Re: updating payloads

2009-01-03 Thread Karl Wettin
I think it would be nice with little payload modification tool in the SVN. karl 2 jan 2009 kl. 23.02 skrev Grant Ingersoll: I don't think there is any API support for this, but in theory it is possible, as long as you aren't changing the size. It sounds like it could work for you

Re: Re-combining already indexed documents

2009-01-02 Thread Karl Wettin
Hello, the easiest way would be to construct the combined document using the data from your primary source rather than reconstructing it from the index. If the source data no longer is available you could still reconstruct a token stream. The data is however a bit spread out so it can

Re: Extract the text that was indexed

2008-12-30 Thread Karl Wettin
30 dec 2008 kl. 17.13 skrev Lebiram: Hi Lebiram, contrib/misc contains a couple of tools that might be of help. Just wanted to reconstruct a new index based on an existing index(but turning off norms) that's all. If you want to create an identical index but without norms use

Re: Any way to ignore repeated terms in TF calculation?

2008-12-26 Thread Karl Wettin
Hi Israel, you can solve your problem at search time by passing a custom Similarity class that looks something like this: private Similarity similarity = new DefaultSimilarity() { public float tf(float v) { return 1f; } public float tf(int i) { return 1f; }

Payloads

2008-12-26 Thread Karl Wettin
I would very much like to hear how people use payloads. Personally I use them for weight only. And I use them a lot, almost in all applications. I factor the weight of synonyms, stems, dediacritization and what not. I create huge indices that contains lots tokens at the same position but

Re: Lucene - Authentication

2008-12-14 Thread Karl Wettin
13 dec 2008 kl. 06.05 skrev Aaron Schon: Hi , if I have a Lucene index (or Solr) that is installed in client premises. how would you go about securing the index from being queries in unauthorized fashion. For example, from malicious users or hackers, or for that matter internal users

Re: Slow queries with lots of hits

2008-12-04 Thread Karl Wettin
Hi Tim, is it possible that the slow queries contains terms that are very common in your index? If so you could replace those clauses with a filter. This would impact the score as filters does nothing with that, but if your query contains enough other clauses that should not be a

Re: Cannot find gdata-server

2008-12-04 Thread Karl Wettin
Hello Anees, the Gdata server was phased out by 2.3. You can still get if from the 2.2 tag in the SVN: http://svn.apache.org/repos/asf/lucene/java/tags/lucene_2_2_0/ karl 5 dec 2008 kl. 07.13 skrev Anees Haider: I have setup lucene, test run it and go through samples. Now I have been

Re: serialVersionUID issue between 2.3 and 2.4

2008-12-01 Thread Karl Wettin
You could get the 2.4 code and set the serialVersionUID of the Term class to the UID assigned to the 2.3 Term class (554776219862331599l) and recompile. As for statically setting a serialVersionUID in the class, one could instead set it to a final value and implement Externalizable in

Re: SpanFirstQuery is not taking wildcard characters (like *) as a logical operator for the preffix

2008-11-28 Thread Karl Wettin
by sequence, but it cant search as a startswith (for library inf*) Karl Wettin wrote: SpanTermQuery is a TermQuery and not a WildcardQuery. You could use a SpanRegexQuery. You could also make your own SpanWildcardQuery based on either WildcardQuery or SpanRegexQuery. You should probably tell

Re: Query time document group boosting

2008-11-27 Thread Karl Wettin
27 nov 2008 kl. 10.15 skrev Toke Eskildsen: On Thu, 2008-11-27 at 07:30 +0100, Karl Wettin wrote: The most scary part is that that you will have to score each and every document that has a source, probably all of the documents in your corpus. I now see my query-logic was flawed. In order

Re: SpanFirstQuery is not taking wildcard characters (like *) as a logical operator for the preffix

2008-11-27 Thread Karl Wettin
SpanTermQuery is a TermQuery and not a WildcardQuery. You could use a SpanRegexQuery. You could also make your own SpanWildcardQuery based on either WildcardQuery or SpanRegexQuery. You should probably tell us a bit about the problem you try to solve rather than asking about the solution

Re: Query time document group boosting

2008-11-26 Thread Karl Wettin
The most scary part is that that you will have to score each and every document that has a source, probably all of the documents in your corpus. So if you have a very large number of documents it might be a bit expensive. Also, appending this query for boost only means that you will get

Re: Scoring issue

2008-11-26 Thread Karl Wettin
Alex, if you have length normalization turned on then the length (the number of tokens and perhaps even the distance between the tokens) of the second document is much greater than the length of the first document. The length is the complete number of tokens in the field, i.e. if you add

Re: InstatiatedIndex questions

2008-11-19 Thread karl wettin
Hi David, thanks for the report! I suppose you speak of IndexWriter vs InstantiatedIndexWriter? These are definitely considered discrepancy problems. I've created a new issue in the tracker: http://issues.apache.org/jira/browse/LUCENE-1462 For what reason do you try to serialize the

Re: InstantiatedIndex help + first impression

2008-11-18 Thread karl wettin
The actual performance depends on how much you load to the index. Can you tell us how many documents and how large these documents are that you have in your index? Compared with RAMDirectory I'vee seen performance boosts of up to 100x in a small index that contains (1-20) Wikipedia sized

Re: InstantiatedIndex help + first impression

2008-11-18 Thread karl wettin
On Wed, Nov 19, 2008 at 3:27 AM, karl wettin [EMAIL PROTECTED] wrote: rewritten query. I.e. this is probably as much a store related expense as it is a Levenshtein calculation expense. this is probably *not* as much a store related.. that is. karl

Re: instantiated index in 2.4

2008-10-29 Thread Karl Wettin
Hi Darren, How large is your corpus? The speed you can expect depends on how much data you load it with. There is a graph in the package level javadocs that shows this: http://lucene.apache.org/java/2_4_0/api/contrib-instantiated/org/apache/lucene/store/instantiated/package-summary.html

Re: Calculation of fieldNorm causes irritating effect of sort order

2008-10-02 Thread Karl Wettin
2 okt 2008 kl. 14.47 skrev Jimi Hullegård: But apparently this setOmitNorms(true) also disables boosting aswell. That is ok for now, but what if we want to use boosting in the future? Is there no way to disable the length normalization while still keeping the boost calculation? You can

Re: Index time Document Boosting and Query Time Sorts

2008-09-24 Thread Karl Wettin
24 sep 2008 kl. 12.40 skrev Grant Ingersoll: One side note based on your example, below: Index time boosting does not have much granularity (only 255 values), in other words, there is a loss of precision. Thus, you want to make sure your boosts are different enough such that you can

Re: lucene Front-end match

2008-09-19 Thread Karl Wettin
19 sep 2008 kl. 11.05 skrev 叶双明: Documentstored/uncompressed,indexedfield:abc Documentstored/uncompressed,indexedfield:bcd How can I get the first Document buy some query string like a , ab or abc but no b and bc? You would create an ngram filter that create grams from the first

Re: Some SSD results to share

2008-09-16 Thread Karl Wettin
Related, I've been considering filesystem based filters on SSD. That ought to be rather fast, consume no memory and be as simple as a RandomAccessFile. I didn't spend to much time on it, gave up when I couldn't figure out when it made sense to close the file. Perhaps it would be nice with

Re: Sorting in lucene through Document boosting

2008-09-15 Thread Karl Wettin
15 sep 2008 kl. 14.08 skrev Dragan Jotanovic: I made simple Similarity implementation: public float tf(float arg0) { return 1f; } Why do you touch the term frequency? Is that prehaps unrelated to what's discussed in this thread? karl

Re: instantiated index in 2.4

2008-09-15 Thread Karl Wettin
15 sep 2008 kl. 18.45 skrev Cam Bazz: I have been looking at instantiated index in the trunk. Does this come with a searcher? Pass an InstantiatedIndexReader to the constructor of an IndexSearcher. Are the adds reflected directly to the index? Yes. An InstantiatedIndexReader is always

Re: instantiated index in 2.4

2008-09-15 Thread Karl Wettin
15 sep 2008 kl. 18.51 skrev Karl Wettin: Are the adds reflected directly to the index? Yes. An InstantiatedIndexReader is always current. You will probably still have to reconstruct your searcher. I never really looked in to what happends if you don't. The second statement was wrong

Re: Frequently updated fields

2008-09-12 Thread Karl Wettin
Hi Wojciech, can you please give us a bit more specific information about the meta data fields that will change? I would recommend you looking at creating filters from your primary persistency for query clauses such as unread/read, mailbox folders, et c. karl 12 sep 2008 kl. 13.57

Re: removing norms

2008-09-12 Thread Karl Wettin
12 sep 2008 kl. 12.25 skrev Bogdan Ghidireac: I have a large index and I want to remove the norms from a field. Is there a way to do this without reindexing everything ? You could invoke IndexReader#setNorm(int, String, float) and set the value to 1f. karl

Re: Frequently updated fields

2008-09-12 Thread Karl Wettin
12 sep 2008 kl. 14.51 skrev Wojciech Strzałka: The most changing fields will be I think: Status (read/unread): in fact I'm affraid of this at most - any mail incoming to the system will need to be indexed at least twice This is why I recommended you to use a

Re: Frequently updated fields

2008-09-12 Thread Karl Wettin
with frequently changing fields. Karl Wettin wrote: Hi Wojciech, can you please give us a bit more specific information about the meta data fields that will change? I would recommend you looking at creating filters from your primary persistency for query clauses such as unread/read, mailbox

Re: string similarity measures

2008-09-04 Thread Karl Wettin
4 sep 2008 kl. 14.38 skrev Cam Bazz: Hello, This came up before but - if we were to make a swear word filter, string edit distances are no good. for example words like `shot` is confused with `shit`. there is also problem with words like hitchcock. appearently i need something like

  1   2   3   4   5   6   >