Unicode normalisation *before* tokenisation?

2011-01-16 Thread Trejkaz
Hi all. I discovered there is a normalise filter now, using ICU's Normalizer2 (org.apache.lucene.analysis.icu.ICUNormalizer2Filter). However, as this is a filter, various problems can result if used with StandardTokenizer. One in particular is half-width Katakana. Supposing you start out with

Re: Unicode normalisation *before* tokenisation?

2011-01-16 Thread Trejkaz
On Mon, Jan 17, 2011 at 11:53 AM, Robert Muir rcm...@gmail.com wrote: On Sun, Jan 16, 2011 at 7:37 PM, Trejkaz trej...@trypticon.org wrote: So I guess I have two questions:    1. Is there some way to do filtering to the text before tokenisation without upsetting the offsets reported

Re: AW: Best practices for multiple languages?

2011-01-19 Thread Trejkaz
On Thu, Jan 20, 2011 at 9:08 AM, Paul Libbrecht p...@hoplahup.net wrote: Wouldn't it be better to prefer precise matches (a field that is analyzed with StandardAnalyzer for example) but also allow matches are stemmed. StandardAnalyzer isn't quite precise, is it?  StandardFilter does some

Re: Which is the +best +fast HTML parser/tokenizer that I can use with Lucene for indexing HTML content today ?

2011-03-12 Thread Trejkaz
On Fri, Mar 11, 2011 at 10:03 PM, shrinath.m shrinat...@webyog.com wrote: I am trying to index content withing certain HTML tags, how do I index it ? Which is the best parser/tokenizer available to do this ? This doesn't really answer the question, but I think it will help... The features you

a faster way to addDocument and get the ID just added?

2011-03-28 Thread Trejkaz
Hi all. I'm trying to parallelise writing documents into an index. Let's set aside the fact that 3.1 is much better at this than 3.0.x... but I'm using 3.0.3. One of the things I need to know is the doc ID of each document added so that we can add them into auxiliary database tables which are

Re: a faster way to addDocument and get the ID just added?

2011-03-29 Thread Trejkaz
On Tue, Mar 29, 2011 at 11:21 PM, Erick Erickson erickerick...@gmail.com wrote: I'm always skeptical of storing the doc IDs since they can change out from underneath you (just delete even a single document and optimize). We never delete documents. Even when a feature request came in to update

Re: Using IndexWriterConfig repeatedly in 3.1

2011-04-01 Thread Trejkaz
On Sat, Apr 2, 2011 at 7:07 AM, Christopher Condit con...@sdsc.edu wrote: I see in the JavaDoc for IndexWriterConfig that: Note that IndexWriter makes a private clone; if you need to subsequently change settings use IndexWriter.getConfig(). However when I attempt to use the same

Re: switching between Query parsers

2011-04-18 Thread Trejkaz
On Thu, Apr 14, 2011 at 9:44 PM, shrinath.m shrinat...@webyog.com wrote: Consider this case : Lucene index contains documents with these fields : title author publisher I have coded my app to use MultiFieldQueryParser so that it queries all fields. Now if user types something like

Re: Immutable OpenBitSet?

2011-04-28 Thread Trejkaz
On Thu, Apr 28, 2011 at 6:13 PM, Uwe Schindler u...@thetaphi.de wrote: In general a *newly* created object that was not yet seen by any other thread is always safe. This is why I said, set all bits in the ctor. This is easy to understand: Before the ctor returns, the object's contents and all

Re: MultiFieldQueryParser with default AND and stopfilter

2011-06-08 Thread Trejkaz
On Wed, Jun 8, 2011 at 6:52 PM, Elmer evanchaste...@gmail.com wrote: the parsed query becomes: '+(title:the) +(title:project desc:project)'. So, the problem is that docs that have the term 'the' only appearing in their desc field are excluded from the results. Subclass MFQP and override

Re: Corrupt segments file full of zeros

2011-06-28 Thread Trejkaz
On Wed, Jun 29, 2011 at 2:24 AM, Michael McCandless luc...@mikemccandless.com wrote: Here's the issue:    https://issues.apache.org/jira/browse/LUCENE-3255 It's because we read the first 0 int to be an ancient segments file format, and the next 0 int to mean there are no segments.  Yuck!

StandardAnalyzer compatibility between 2.3 and 3.0

2011-07-10 Thread Trejkaz
Hi all. I created a test using Lucene 2.3. When run, this generates a single token: public static void main(String[] args) throws Exception { String string = \u0412\u0430\u0441\u0438\u0301\u043B\u044C\u0435\u0432; StandardAnalyzer analyser = new StandardAnalyzer();

Re: Searching for Empty Field

2011-07-15 Thread Trejkaz
On Fri, Jul 15, 2011 at 10:02 AM, Trieu, Jason T trieu.ja...@con-way.com wrote: Hi all, I read postings about searching for empty field with but did not find any cases of successful search using query language syntax itself(-myField:[* TO *] for example). We have been using: -myField:*

Re: Searching for Empty Field

2011-07-15 Thread Trejkaz
On Fri, Jul 15, 2011 at 4:45 PM, Uwe Schindler u...@thetaphi.de wrote: Hi, The crappy thing is that to actually detect if there are any tokens in the field you need to make a TokenStream which can be used to read the first token and then rewind again.  I'm not sure if there is such a thing

Rewriting other query types into span queries and two questions about this

2011-08-04 Thread Trejkaz
Hi all. I am writing a custom query parser which strongly resembles StandardQueryParser (I use a lot of the same processors and builders, with a slightly customised config handler and a completely new syntax parser written as an ANTLR grammar.) My parser has additional syntax for span queries.

Re: Grouping Clauses to Preserve Order of Boolean Precedence

2011-08-04 Thread Trejkaz
On Fri, Aug 5, 2011 at 1:57 AM, Jim Swainston jimswains...@googlemail.com wrote: So if the Text input is: Marketing AND Smith OR Davies I want my program to work out that this should be grouped as the following (as AND has higher precedence than OR): (Marketing AND Smith) OR Davies. I'm

Re: Rewriting other query types into span queries and two questions about this

2011-08-07 Thread Trejkaz
On Mon, Aug 8, 2011 at 8:58 AM, Michael Sokolov soko...@ifactory.com wrote: Can you do something approximately equivalent like: within(5, 'my', and('cat', 'dog')) - within(5, 'my', within(5, 'cat', 'dog') ) Might not be exactly the same in terms of distances (eg cat x x x my x x x dog)

Re: Rewriting other query types into span queries and two questions about this

2011-08-07 Thread Trejkaz
On Mon, Aug 8, 2011 at 10:00 AM, Trejkaz trej...@trypticon.org wrote:    within(5, 'my', and('cat', 'dog')) - within(5, 'my', within(10, 'cat', 'dog') ) To extend my example and maybe make it a bit more hellish, take this one: within(2, A, and(B, or(C, and(D, E After rewriting both

Strange change to query parser behaviour in recent versions

2011-08-17 Thread Trejkaz
Hi all. Suppose I am searching for - 限定 In 3.0, QueryParser would parse this as a phrase query. In 3.3, it parses it as a boolean query, but offers an option to treat it like a phrase. Why would the default be not to do this? Surely you would always want it to become a phrase query. The new

Re: Strange change to query parser behaviour in recent versions

2011-08-20 Thread Trejkaz
On Fri, Aug 19, 2011 at 11:05 AM, Chris Hostetter hossman_luc...@fucit.org wrote: See LUCENE-2458 for the backstory. the argument was that while phrase queries were historicly generated by the query parser when a single (white space deliminated) chunk of query parser input produced multiple

Re: Strange change to query parser behaviour in recent versions

2011-08-21 Thread Trejkaz
On Sat, Aug 20, 2011 at 7:00 PM, Robert Muir rcm...@gmail.com wrote: On Sat, Aug 20, 2011 at 3:34 AM, Trejkaz trej...@trypticon.org wrote: As an aside, Google's behaviour seems to follow the old way.  For instance, [[ 限定 ]] returns 640,000,000 hits and [[ 限 定 ]] returns 772,000,000

IndexWriter.ramSizeInBytes() no longer returns to 0 after commit()?

2011-08-22 Thread Trejkaz
Hi all. We are using IndexWriter with no limits set and managing the commits ourselves, mainly so that we can ensure they are done at the same time as other (non-Lucene) commits. After upgrading from 3.0 ~ 3.3, we are seeing a change in ramSizeInBytes() behaviour where it is no longer resetting

Re: IndexWriter.ramSizeInBytes() no longer returns to 0 after commit()?

2011-08-23 Thread Trejkaz
On Wed, Aug 24, 2011 at 4:45 AM, Michael McCandless luc...@mikemccandless.com wrote: Hmm... this looks like a side-effect of LUCENE-2680, which was merged back from trunk to 3.1. So the problem is, IW recycles the RAM it has allocated, and so this method is returning the allocated RAM, even

Re: Possible to use span queries to avoid stepping over null index positions

2011-08-27 Thread Trejkaz
On Sat, Aug 27, 2011 at 2:30 AM, ikoelli...@daegis.com wrote: Hello, In our indexes we have a field that is a combination of other various metadata fields (i.e. subject, from, to, etc.). Each field that is added has a null position at the beginning. As an example, in Luke the field data

Re: Extracting all documents for a given search

2011-09-19 Thread Trejkaz
On Mon, Sep 19, 2011 at 3:50 AM, Charlie Hubbard charlie.hubb...@gmail.com wrote: Here was the prior API I was calling:        Hits hits = getSearcher().search( query, filter, sort ); The new API:        TopDocs hits = getSearcher().search( query, filter, startDoc + length, sort ); So

SpanNearQuery and matching spans inside the first span

2011-12-05 Thread Trejkaz
Supposing I have a document with just hi there as the text. If I do a span query like this: near(near(term('hi'), term('there'), slop=0, forwards), term('hi'), slop=1, any-direction) that returns no hits. However, if I do a span query like this: near(near(term('hi'), term('there'),

Re: Query that returns all docs that contain a field

2011-12-19 Thread Trejkaz
On Mon, Dec 19, 2011 at 9:05 PM, Paul Taylor paul_t...@fastmail.fm wrote: I was looking for a Query that returns all documents that contain a particular field, it doesnt matter what the value of the field is just that the document contains the field. If you don't care about performance (or if

Remoting Lucene

2012-01-09 Thread Trejkaz
Hi all. I want to access a Lucene index remotely. I'm aware of a couple of options for it which seem to operate more or less at the IndexSearcher level - send a query, get back results. But in our case, we use IndexReader directly for building statistics, which is too slow to do via individual

Re: comparing index fields within a query

2012-01-23 Thread Trejkaz
On Mon, Jan 23, 2012 at 11:31 PM, Jamie ja...@stimulussoft.com wrote: Ian Thanks. I'll have to read up about it. I have a lot of comparisons to make, so cannot precompute the values. How many is a lot? If it were 100 or so I would still be tempted to do all 4,950 comparisons and find some

Re: NGraming document for similar documents matching

2012-01-26 Thread Trejkaz
On Fri, Jan 27, 2012 at 10:41 AM, Saurabh Gokhale saurabhgokh...@gmail.com wrote: I wanted to check if Ngraming the document contents (space is not the issue) would make any good for better matching? Currently I see Ngram is mostly use for auto complete or spell checker but is this useful for

Lucene appears to use memory maps after unmapping them

2012-01-31 Thread Trejkaz
Hi all. I've found a rather frustrating issue which I can't seem to get to the bottom of. Our application will crash with an access violation around the time when the index is closed, with various indications of what's on the stack, but the common things being SegmentTermEnum.next and

Re: Lucene appears to use memory maps after unmapping them

2012-01-31 Thread Trejkaz
On Wed, Feb 1, 2012 at 11:30 AM, Robert Muir rcm...@gmail.com wrote: the problem is caused by searching indexreaders after you closed them. in general we can try to add more and more safety, but at the end of the day, if you close an indexreader while a search is running, you will have

Re: Lucene appears to use memory maps after unmapping them

2012-01-31 Thread Trejkaz
On Wed, Feb 1, 2012 at 1:14 PM, Robert Muir rcm...@gmail.com wrote: No, I don't think you should use close at all, because your problem is you are calling close() when its unsafe to do so (you still have other threads that try to search the reader after you closed it). Instead of trying to

Merging results from two searches on two separate Searchers

2012-02-14 Thread Trejkaz
Hi all. We have 1..N indexes for each time someone adds some data. Each time they can choose different tokenisation settings. Because of this, each text index has its own query parser instance. Because each query parser could generate a different Query (though I guess whether they do or not is

Re: Merging results from two searches on two separate Searchers

2012-02-14 Thread Trejkaz
On Wed, Feb 15, 2012 at 11:46 AM, Uwe Schindler u...@thetaphi.de wrote: Scores are only compatible if the query is the same, which is not the case for you. So you cannot merge hits from different queries. So I guess in the case where the different query parsers happen to generate the same

Re: Hanging with fixed thread pool in the IndexSearcher multithread code

2012-02-19 Thread Trejkaz
On Mon, Feb 20, 2012 at 12:07 PM, Uwe Schindler u...@thetaphi.de wrote: See my response. The problem is not in Lucene; its in general a problem of fixed thread pools that execute other callables from within a callable running at the moment in the same thread pool. Callables are simply

Re: How to add DocValues Field to a document in an optimal manner.

2012-02-29 Thread Trejkaz
On Thu, Mar 1, 2012 at 6:20 PM, Sudarshan Gaikaiwari sudars...@acm.org wrote: Hi https://builds.apache.org/job/Lucene-trunk/javadoc/core/org/apache/lucene/document/DocValuesField.html The documentation at the above link indicates that the optimal way to add a DocValues field is to create it

Re: Range queries in successive positions

2012-03-01 Thread Trejkaz
On Fri, Mar 2, 2012 at 6:22 PM, su ha s_han...@yahoo.com wrote: Hi, I'm new to Lucene. I'm indexed some documents with Lucene and need to sanitize it to ensure that they do not have any social security numbers (3-digits 2-digits 4-digits). (How) Can I write a query (with the QueryParser)

Re: Sort.INDEXORDER works incorrectly?

2012-04-17 Thread Trejkaz
On Wed, Apr 18, 2012 at 9:27 AM, Vladimir Gubarkov xon...@gmail.com wrote: Hi, dear Lucene specialists, The only explanation I could think of is the new TieredMergePolicy instead of old LogMergePolicy. Could it be that because of TieredMergePolicy merges not adjacent segments - this results

Re: Lucene's internal doc ID space

2012-05-12 Thread Trejkaz
On Fri, May 11, 2012 at 9:56 PM, Jong Kim jong.luc...@gmail.com wrote: 2. If Lucene can recycle old IDs, it would be even better if I could force it to re-use a particular doc ID when updating a document by deleting old one and creating new one. This scheme will allow me to reference this doc

Re: Approches/semantics for arbitrarily combining boolean and proximity search operators?

2012-05-16 Thread Trejkaz
On Thu, May 17, 2012 at 7:11 AM, Chris Harris rygu...@gmail.com wrote: but also crazier ones, perhaps like agreement w/5 (medical and companion) (dog or dragon) w/5 (cat and cow) (daisy and (dog or dragon)) w/25 (cat not cow) [skip] Everything in your post matches our experience. We ended up

Re: Store a query in a database for later use

2012-05-17 Thread Trejkaz
On Fri, May 18, 2012 at 6:23 AM, Jamie Johnson jej2...@gmail.com wrote: I think you want to have a look at the QueryParser classes.  Not sure which you're using to start with but probably the default QueryParser should suffice. There are (at least) two catches though: 1. The semantics of a

Re: Approches/semantics for arbitrarily combining boolean and proximity search operators?

2012-05-25 Thread Trejkaz
On Sat, May 26, 2012 at 12:07 PM, Chris Harris rygu...@gmail.com wrote: Alternatively, if you insist that query merger w/5 (medical and agreement) should match document medical x x x merger x x x agreement then you can propagate 2x the parent's slop value down to child queries. This is in

Re: easy one? IN and OR stopword help

2012-06-07 Thread Trejkaz
On Fri, Jun 8, 2012 at 5:35 AM, Jack Krupansky j...@basetechnology.com wrote: Well, if you have defined OR/or and IN/in as stopwords, what is it you expect other than for the analyzer to ignore those terms (which with a boolean “AND” means match nothing)? Is this behaviour really logical?

Re: QueryParser and BooleanQuery

2012-07-23 Thread Trejkaz
On Mon, Jul 23, 2012 at 10:16 PM, Deepak Shakya just...@gmail.com wrote: Hey Jack, Can you let me know how should I do that? I am using the Lucene 3.6 version and I dont see any parse() method for StandardAnalyzer. In your case, presumably at indexing time you should be using a

Re: Usage of NoMergePolicy and its potential implications

2012-07-25 Thread Trejkaz
On Thu, Jul 26, 2012 at 5:38 AM, Simon Willnauer simon.willna...@gmail.com wrote: you really shouldn't do that! If you use lucene as a Primary key generator why don't you build your own on top. Just add one layer that accepts the document and returns the PID and internally put it in an ID

Re: Why does this query slow down Lucene?

2012-08-15 Thread Trejkaz
On Thu, Aug 16, 2012 at 11:27 AM, zhoucheng2008 zhoucheng2...@gmail.com wrote: +(title:21 title:a title:day title:once title:a title:month) Looks like you have a fairly big boolean query going on here, and some of the terms you're using are really common ones like a. Are you using AND or OR

Re: Finding the most matching (cf. similar) document to another one

2012-09-07 Thread Trejkaz
On Fri, Sep 7, 2012 at 6:12 PM, Jochen Hebbrecht jochenhebbre...@gmail.com wrote: Hi qibaoyuan, I tried your second solution, using the scoring data. I think in this way, I could use MoreLikeThis. All documents with a score X are a possible match :-). FWIW, there is also

Re: SpanNearQuery distance issue

2012-09-19 Thread Trejkaz
On Thu, Sep 20, 2012 at 4:28 AM, vempap phani.vemp...@emc.com wrote: Hello All, I've a issue with respect to the distance measure of SpanNearQuery in Lucene. Let's say I've following two documents: DocID: 6, cotent:1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1001 1002 1003 1004 1005

Re: international stop set?

2012-10-26 Thread Trejkaz
On Sat, Oct 27, 2012 at 1:53 PM, Tom fivemile...@gmail.com wrote: Hello, using Lucene 4.0.0b, I am trying to get a superset of all stop words (for an international app). I have looked around, and not found anything specific. Is this the way to go? CharArraySet internationalSet = new

Re: Excessive use of IOException without proper documentation

2012-11-04 Thread Trejkaz
On Mon, Nov 5, 2012 at 4:25 AM, Michael-O 1983-01...@gmx.net wrote: Continuing my answer from above. Have you ever worked with the Spring Framework? They apply a very nice exception translation pattern. All internal exceptions are turned to specialized unchecked exceptions like

Re: Can I still use SearcherManager in this situation?

2012-11-07 Thread Trejkaz
On Wed, Nov 7, 2012 at 10:11 PM, Ian Lea ian@gmail.com wrote: 4.0 has maybeRefreshBlocking which is useful if you want to guarantee that the next call to acquire() will return a refreshed instance. You don't say what version you're using. If you're stuck on 3.6.1 can you do something with

Re: Can I still use SearcherManager in this situation?

2012-11-09 Thread Trejkaz
On Thu, Nov 8, 2012 at 8:29 AM, Trejkaz trej...@trypticon.org wrote: It's not only protected... but the class is final as well (the method might as well be private so that it doesn't give a false sense of hope that it can be overridden.) I might have to clone the whole class just to make

Performance of IndexSearcher.explain(Query)

2012-11-20 Thread Trejkaz
I have a feature I wanted to implement which required a quick way to check whether an individual document matched a query or not. IndexSearcher.explain seemed to be a good fit for this. The query I tested was just a BooleanQuery with two TermQuery inside it, both with MUST. I ran an empty query

Re: Using Lucene 2.3 indices with Lucene 4.0

2012-11-20 Thread Trejkaz
On Wed, Nov 21, 2012 at 12:33 AM, Ramprakash Ramamoorthy youngestachie...@gmail.com wrote: On Tue, Nov 20, 2012 at 5:42 PM, Danil ŢORIN torin...@gmail.com wrote: Ironically most of the changes are in unicode handling and standard analyzer ;) Ouch! It hurts then ;) What we did going from 2

Re: Performance of IndexSearcher.explain(Query)

2012-11-20 Thread Trejkaz
On Wed, Nov 21, 2012 at 10:40 AM, Robert Muir rcm...@gmail.com wrote: Explain is not performant... but the comment is fair I think? Its more of a worst-case, depends on the query. Explain is going to rewrite the query/create the weight and so on just to advance() the scorer to that single doc

Does anyone have tips on managing cached filters?

2012-11-22 Thread Trejkaz
I recently implemented the ability for multiple users to open the index in the same process (whoa, you might think, but this has been a single user application forever and we're only just making the platform capable of supporting more than that.) I found that filters are being stored twice and

Re: Does anyone have tips on managing cached filters?

2012-11-27 Thread Trejkaz
On Tue, Nov 27, 2012 at 9:31 AM, Robert Muir rcm...@gmail.com wrote: On Thu, Nov 22, 2012 at 11:10 PM, Trejkaz trej...@trypticon.org wrote: As for actually doing the invalidation, CachingWrapperFilter itself doesn't appear to have any mechanism for invalidation at all, so I imagine I

Re: Does anyone have tips on managing cached filters?

2012-11-27 Thread Trejkaz
On Wed, Nov 28, 2012 at 2:09 AM, Robert Muir rcm...@gmail.com wrote: I don't understand how a filter could become invalid even though the reader has not changed. I did state two ways in my last email, but just to re-iterate: (1): The filter reflects a query constructed from lines in a text

Difference in behaviour between LowerCaseFilter and String.toLowerCase()

2012-11-29 Thread Trejkaz
Hi all. trying to figure out what I was doing wrong in some of my own code so I looked to LowerCaseFilter since I thought I remembered it doing this correctly, and lo and behold, it failed the same test I had written. Is this a bug or an intentional difference in behaviour? @Test public

Re: Difference in behaviour between LowerCaseFilter and String.toLowerCase()

2012-11-30 Thread Trejkaz
On Fri, Nov 30, 2012 at 8:22 PM, Ian Lea ian@gmail.com wrote: Sounds like a side effect of possibly different, locale-dependent, results of using String.toLowerCase() and/or Character.toLowerCase(). http://docs.oracle.com/javase/6/docs/api/java/lang/String.html#toLowerCase() specifically

Re: Difference in behaviour between LowerCaseFilter and String.toLowerCase()

2012-12-03 Thread Trejkaz
On Tue, Dec 4, 2012 at 10:09 AM, Vitaly Funstein vfunst...@gmail.com wrote: If you don't need to support case-sensitive search in your application, then you may be able to get away with adding string fields to your documents twice - lowercase version for indexing only, and verbatim to store.

Re: Lucene 4.0, Serialization

2012-12-04 Thread Trejkaz
On Tue, Dec 4, 2012 at 8:33 PM, BIAGINI Nathan nathan.biag...@altanis.fr wrote: I need to send a class containing Lucene elements such as `Query` over the network using EJB and of course this class need to be serialized. I marked my class as `Serializable` but it does not seems to be enough:

Re: RE: Stemming and Wildcard - or fire and water

2013-01-04 Thread Trejkaz
On Sat, Jan 5, 2013 at 4:06 AM, Klaus Nesbigall llg...@gmx.de wrote: The actual behavior doesn't work either. The english word families will not be found in case the user types the query familie* So why solve the problem by postulate one oppinion as right and another as wrong? A simple

Re: Is StandardAnalyzer good enough for multi languages...

2013-01-08 Thread Trejkaz
On Wed, Jan 9, 2013 at 10:57 AM, Steve Rowe sar...@gmail.com wrote: Trejkaz (and maybe Sai too): ICUTokenizer in Lucene's icu module may be be of interest to you, along with the token filters in that same module. - Steve ICUTokenizer sounds like it's implementing UAX #29, which is exactly

Re: Is StandardAnalyzer good enough for multi languages...

2013-01-09 Thread Trejkaz
On Wed, Jan 9, 2013 at 5:25 PM, Steve Rowe sar...@gmail.com wrote: Dude. Go look. It allows for per-script specialization, with (non-UAX#29) specializations by default for Thai, Lao, Myanmar and Hewbrew. See DefaultICUTokenizerConfig. It's filled with exactly the opposite of what you

Re: Custom Query Syntax/Parser

2013-01-28 Thread Trejkaz
On Tue, Jan 29, 2013 at 3:42 AM, Andrew Gilmartin and...@andrewgilmartin.com wrote: When I first started using Lucene, Lucene's Query classes where not suitable for use with the Visitor pattern and so I created my own query class equivalants and other more specialized ones. Lucene's classes

Re: How to properly use updatedocument in lucene.

2013-01-31 Thread Trejkaz
On Thu, Jan 31, 2013 at 11:05 PM, Michael McCandless luc...@mikemccandless.com wrote: It's confusing, but you should never try to re-index a document you retrieved from a searcher, because certain index-time details (eg, whether a field was tokenized) are not preserved in the stored document.

Migrating from using doc IDs to using application IDs from the FieldCache

2013-01-31 Thread Trejkaz
Hi all. We have an application which has been around for so long that it's still using doc IDs to key to an external database. Obviously this won't work forever (even in Lucene 3.x we had to use a custom merge policy to keep it working) so we want to introduce application IDs eventually. We have

Re: how to search a certain number of document without travelling all related documents

2013-03-14 Thread Trejkaz
On Tue, Mar 12, 2013 at 10:42 PM, Hu Jing huj@gmail.com wrote: so my question is how to achieve a non-sort query method, this method can get result constantly and don't travel all unnecessary doc. Does Lucene supply some strategies to implement this? If you want the result as soon as

Re: getLocale of SortField

2013-07-09 Thread Trejkaz
On Wed, Jul 10, 2013 at 12:53 AM, Uwe Schindler u...@thetaphi.de wrote: Hi, there is no more locale-based sorting in Lucene 4.x. It was deprecated in 3.x, so you should get a warning about deprecation already! I wasn't sure about this because we are on 3.6 and I didn't see a deprecation

Re: getLocale of SortField

2013-07-10 Thread Trejkaz
On Wed, Jul 10, 2013 at 4:20 PM, Uwe Schindler u...@thetaphi.de wrote: Hi, The fast replacement (means sorting works as fast without collating) is to index the fields used for sorting with CollationKeyAnalyzer ([snip]). The Collator you get from e.g. the locale. [snip] The better was is,

Callback for commits

2013-08-11 Thread Trejkaz
Hi all. Is there some kind of callback where we can be notified about commits? Sometimes a call to commit() doesn't actually commit anything (e.g. if there is nothing in memory at the time.) I'm not really sure what's wrong with assuming it does commit something, because it's another developer

Re: Stream Closed Exception and Lock Obtain Failed Exception while reading the file in chunks iteratively.

2013-09-02 Thread Trejkaz
On Mon, Sep 2, 2013 at 4:10 PM, Ankit Murarka ankit.mura...@rancoretech.com wrote: There's a reason why Writer is being opened everytime inside a while loop. I usually open writer in main method itself as suggested by you and pass a reference to it. However what I have observed is that if my

JapaneseAnalyser filter ordering

2014-01-15 Thread Trejkaz
The current ordering of JapaneseAnalyser's token filters is as follows: 1. JapaneseBaseFormFilter 2. JapanesePartOfSpeechStopFilter 3. CJKWidthFilter (similar to NormaliseFilter) 4. StopFilter 5. JapaneseKatakanaStemFilter 6. LowerCaseFilter Our existing support for

MultiFieldAttribute is deprecated but the replacement is not documented

2014-01-22 Thread Trejkaz
In 3.6.2, I notice MultiFieldAttribute is deprecated. So I looked to the docs to find the replacement: https://lucene.apache.org/core/3_6_2/api/contrib-queryparser/org/apache/lucene/queryParser/standard/config/MultiFieldAttribute.html ...and the Deprecated note doesn't say what we're supposed to

Re: exporting a query to String with default operator = AND ?

2014-01-25 Thread Trejkaz
On Sat, Jan 25, 2014 at 4:29 AM, Olivier Binda olivier.bi...@wanadoo.fr wrote: I would like to serialize a query into a string (A) and then to unserialize it back into a query (B) I guess that a solution is A) query.toString() B) StandardQueryParser().parse(query,) If your custom query

Re: limitation on token-length for KeywordAnalyzer?

2014-01-27 Thread Trejkaz
On Mon, Jan 27, 2014 at 3:48 AM, Andreas Brandl m...@3.141592654.de wrote: Is there some limitation on the length of fields? How do I get around this? [cut] My overall goal is to index (arbitrary sized) text files and run a regular expression search using lucene's RegexpQuery. I suspect the

Highlighting text, do I seriously have to reimplement this from scratch?

2014-02-03 Thread Trejkaz
Hi all. I'm trying to find a precise and reasonably efficient way to highlight all occurrences of terms in the query, only highlighting fields which match the corresponding fields used in the query. This seems like it would be a fairly common requirement in applications. We have an existing

Re: Highlighting text, do I seriously have to reimplement this from scratch?

2014-02-05 Thread Trejkaz
On Wed, Feb 5, 2014 at 4:16 AM, Earl Hood e...@earlhood.com wrote: Our current solution is to do highlighting on the client-side. When search happens, the search results from the server includes the parsed query terms so the client has an idea of which terms to highlight vs trying to

Re: Limiting the fields a user can query on

2014-02-20 Thread Trejkaz
On Thu, Feb 20, 2014 at 1:43 PM, Jamie Johnson jej2...@gmail.com wrote: Is there a way to limit the fields a user can query by when using the standard query parser or a way to get all fields/terms that make up a query without writing custom code for each query subclass? If you mean

Re: encoding problem when retrieving document field value

2014-03-03 Thread Trejkaz
On Tue, Mar 4, 2014 at 4:44 AM, Jack Krupansky j...@basetechnology.com wrote: What is the hex value for that second character returned that appears to display as an apostrophe? Hex 92 (decimal 146) is listed as Private Use 2, so who knows what it might display as. Well, if they're dealing

Re: searching with stemming

2014-06-09 Thread Trejkaz
On Mon, Jun 9, 2014 at 7:57 PM, Jamie ja...@mailarchiva.com wrote: Greetings Our app currently uses language specific analysers (e.g. EnglishAnalyzer, GermanAnalyzer, etc.). We need an option to disable stemming. What's the recommended way to do this? These analyzers do not include an option

Reading a v2 index in v4

2014-06-09 Thread Trejkaz
Hi all. The inability to read people's existing indexes is essentially the only thing stopping us upgrading to v4, so we're stuck indefinitely on v3.6 until we find a way around this issue. As I understand it, Lucene 4 added the notion of codecs which can precisely choose how to read and write

Re: Reading a v2 index in v4

2014-06-09 Thread Trejkaz
On Mon, Jun 9, 2014 at 10:17 PM, Adrien Grand jpou...@gmail.com wrote: Hi, It is not possible to read 2.x indices from Lucene 4, even with a custom codec. For instance, Lucene 4 needs to hook into SegmentInfos.read to detect old 3.x indices and force the use of the Lucene3x codec since these

Is it possible to rewrite a MultiPhraseQuery to a SpanQuery?

2014-08-18 Thread Trejkaz
Someone asked if it was possible to do a SpanNearQuery between a TermQuery and a MultiPhraseQuery. Sadly, you can only use SpanNearQuery with other instances of SpanQuery, so we have a gigantic method where we rewrite as many queries as possible to SpanQuery. For instance, TermQuery can trivially

Can some terms from analysis be silently dropped when indexing? Because I'm pretty sure I'm seeing that happen.

2014-08-18 Thread Trejkaz
Unrelated to my previous mail to the list, but related to the same investigation... The following test program just indexes a phrase of nonsense words using and then queries for one of the words using the same analyser. The same analyser is being used both for indexing and for querying, yet in

Re: Can some terms from analysis be silently dropped when indexing? Because I'm pretty sure I'm seeing that happen.

2014-08-18 Thread Trejkaz
Also in case it makes a difference, we're using Lucene v3.6.2. TX - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Can some terms from analysis be silently dropped when indexing? Because I'm pretty sure I'm seeing that happen.

2014-08-19 Thread Trejkaz
On Tue, Aug 19, 2014 at 5:27 PM, Uwe Schindler u...@thetaphi.de wrote: Hi, You forgot to close (or commit) IndexWriter before opening the reader. Huh? The code I posted is closing it: try (IndexWriter writer = new IndexWriter(directory, new IndexWriterConfig(Version.LUCENE_36,

Re: Can some terms from analysis be silently dropped when indexing? Because I'm pretty sure I'm seeing that happen.

2014-08-19 Thread Trejkaz
Lucene 4.9 gives much the same result. import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.ja.JapaneseAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.document.TextField;

Re: Can some terms from analysis be silently dropped when indexing? Because I'm pretty sure I'm seeing that happen.

2014-08-25 Thread Trejkaz
It seems like nobody knows the answer, so I'm just going to file a bug. TX - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: ArrayIndexOutOfBoundsException: -65536

2014-10-13 Thread Trejkaz
Bit of thread necromancy here, but I figured it was relevant because we get exactly the same error. On Thu, Jan 19, 2012 at 12:47 AM, Michael McCandless luc...@mikemccandless.com wrote: Hmm, are you certain your RAM buffer is 3 MB? Is it possible you are indexing an absurdly enormous

Re: OutOfMemoryError indexing large documents

2014-11-26 Thread Trejkaz
On Wed, Nov 26, 2014 at 2:09 PM, Erick Erickson erickerick...@gmail.com wrote: Well 2 seriously consider the utility of indexing a 100+M file. Assuming it's mostly text, lots and lots and lots of queries will match it, and it'll score pretty low due to length normalization. And you probably

Re: Lucene query behavior using NOT

2015-02-08 Thread Trejkaz
On Sun, Feb 8, 2015 at 9:04 PM, Uwe Schindler u...@thetaphi.de wrote: Hi, Lucene does not use algebraic / boolean logic! Maybe review this blog post: https://lucidworks.com/blog/why-not-and-or-and-not/ This article is an old classic. The plus, minus, nothing operators aren't without their

BytesRef violates the principle of least astonishment

2015-05-19 Thread Trejkaz
Hi all. The Lucene 4 migration guide helpfully suggests to work with BytesRef directly rather than converting to string, but I disagree. Take the following example of building up a ListTerm by iterating a TermsEnum. I think it is written in a fairly straight-forward fashion. I added some println

Re: BytesRef violates the principle of least astonishment

2015-05-20 Thread Trejkaz
On Wed, May 20, 2015 at 5:12 PM, András Péteri apet...@b2international.com wrote: As Olivier wrote, multiple BytesRef instances can share the underlying byte array when representing slices of existing data, for performance reasons. BytesRef#clone()'s javadoc comment says that the result will

Re: BytesRef violates the principle of least astonishment

2015-05-21 Thread Trejkaz
On Thu, May 21, 2015 at 9:44 AM, Chris Hostetter hossman_luc...@fucit.org wrote: If you really feel strongly about this, and want to advocate for more consistency arround the meaning/implementation of clone() in Java APIs, i suggest you take it up with the Open JDK project, and focus on a more

Getting a proper ID value into every document

2015-06-04 Thread Trejkaz
Hi all. We had been going for the longest time abusing Lucene's doc IDs as our own IDs and of course all our filters still work like this. But at the moment, we're looking at ways to break our dependencies on this. One of the motivators for this is the outright removal of FieldCache in Lucene 5.

Specifying a Version vs. not specifying a Version

2015-05-28 Thread Trejkaz
Hi all. I know with older Lucene there was a recommendation never to use Version.CURRENT because it would break backwards compatibility. So we changed all our code over to call, for instance, new StandardTokenizer(Version.LUCENE_36, createReader()). Now StandardTokenizer(Version, Reader) is

Re: Specifying a Version vs. not specifying a Version

2015-05-31 Thread Trejkaz
On Sat, May 30, 2015 at 9:33 AM, Chris Hostetter hossman_luc...@fucit.org wrote: My best understanding based on what I see in the current code, is that if you care about backcompat: * you must call setVersion() on any *Analyzer* instances you construct before using them * you must *not*

  1   2   3   >