date:20110623

[
https://issues.apache.org/jira/browse/LUCENE-3225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13053675#comment-13053675
]

Simon Willnauer commented on LUCENE-3225:
-

Mike this seems like a good improvement but I think letting a user change the
behavior of method X by passing true / false to method Y is no good. I think
this is kind of error prone plus its cluttering the seek method though. Once
Boolean is enough here. I think we should rather restrict this to allow users
to pull an exactMatchOnly TermsEnum which does only support exact matches and
throws a clear exception if next is called. I know that makes things slightly
harder especially to deal with our ThreadLocal cached TermsEnum instances but I
think that is better here. Can we somehow leave the extra CPU work to the
term() call and make this entirely lazy?

Optimize TermsEnum.seek when caller doesn't need next term
--

Key: LUCENE-3225
URL: https://issues.apache.org/jira/browse/LUCENE-3225
Project: Lucene - Java
Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
Fix For: 4.0

Attachments: LUCENE-3225.patch

Some codecs are able to save CPU if the caller is only interested in
exact matches. EG, Memory codec and SimpleText can do more efficient
FSTEnum lookup if they know the caller doesn't need to know the term
following the seek term.
We have cases like this in Lucene, eg when IW deletes documents by
Term, if the term is not found in a given segment then it doesn't need
to know the ceiling term. Likewise when TermQuery looks up the term
in each segment.
I had done this change as part of LUCENE-3030, which is a new terms
index that's able to save seeking for exact-only lookups, but now that
we have Memory codec that can also save CPU I think we should commit
this today.
The change adds a boolean onlyExact param to seek(BytesRef).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2193) Re-architect Update Handler

[
https://issues.apache.org/jira/browse/SOLR-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13053678#comment-13053678
]

Simon Willnauer commented on SOLR-2193:
---

bq. Curious; why is the resolution status invalid?

well we decided to cut this into two new issues and close this one. see:

bq.further developments in SOLR-2565 and SOLR-2566

there have been discussions about the focus here so we made it more clear.

Re-architect Update Handler
---

Key: SOLR-2193
URL: https://issues.apache.org/jira/browse/SOLR-2193
Project: Solr
Issue Type: Improvement
Reporter: Mark Miller
Assignee: Robert Muir
Fix For: 4.0

Attachments: SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch,
SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch

The update handler needs an overhaul.
A few goals I think we might want to look at:
1. Cleanup - drop DirectUpdateHandler(2) line - move to something like
UpdateHandler, DefaultUpdateHandler
2. Expose the SolrIndexWriter in the api or add the proper abstractions to
get done what we now do with special casing:
if (directupdatehandler2)
success
else
failish
3. Stop closing the IndexWriter and start using commit (still lazy IW init
though).
4. Drop iwAccess, iwCommit locks and sync mostly at the Lucene level.
5. Keep NRT support in mind.
6. Keep microsharding in mind (maintain logical index as multiple physical
indexes)
7. Address the current issues we face because multiple original/'reloaded'
cores can have a different IndexWriter on the same index.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3229) Overlaped SpanNearQuery

2011-06-23 Thread Paul Elschot (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13053780#comment-13053780
]

Paul Elschot commented on LUCENE-3229:
--

To reduce surprises like this one when nested spans are used, the ordered case
might be changed to require no overlap at all.
To do that one could compare the end of one spans with the beginning of the
next one.

AFAIK none of the existing test cases uses a nested span query, so more some
test cases for that would be good to have.

The docSpansOrdered method in NearSpansUnordered from the SpanOverLap2.diff
patch
is the same as the existing docSpansOrdered method in NearSpansOrdered.
That is probably not intended.

Could you provide patches as decribed here:
http://wiki.apache.org/lucene-java/HowToContribute ?

Overlaped SpanNearQuery
---

Key: LUCENE-3229
URL: https://issues.apache.org/jira/browse/LUCENE-3229
Project: Lucene - Java
Issue Type: Bug
Components: core/search
Affects Versions: 3.1
Environment: Windows XP, Java 1.6
Reporter: ludovic Boutros
Priority: Minor
Attachments: SpanOverlap.diff, SpanOverlap2.diff,
SpanOverlapTestUnit.diff

While using Span queries I think I've found a little bug.
With a document like this (from the TestNearSpansOrdered unit test) :
w1 w2 w3 w4 w5
If I try to search for this span query :
spanNear([spanNear([field:w3, field:w5], 1, true), field:w4], 0, true)
the above document is returned and I think it should not because 'w4' is not
after 'w5'.
The 2 spans are not ordered, because there is an overlap.
I will add a test patch in the TestNearSpansOrdered unit test.
I will add a patch to solve this issue too.
Basicaly it modifies the two docSpansOrdered functions to make sure that the
spans does not overlap.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3226) rename SegmentInfos.FORMAT_3_1 and improve description in CheckIndex


[ 
https://issues.apache.org/jira/browse/LUCENE-3226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13053784#comment-13053784
 ] 

Michael McCandless commented on LUCENE-3226:


Can we have the constant name be descriptive (reflect what actually changed) 
and then add a comment expressing the version when that change was made to 
Lucene?

I think we should name them primarily for the benefit of developers working 
with the source code going forward, and only secondarily for the future 
developer who will some day need to remove them...

 rename SegmentInfos.FORMAT_3_1 and improve description in CheckIndex
 

 Key: LUCENE-3226
 URL: https://issues.apache.org/jira/browse/LUCENE-3226
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 3.1, 3.2
Reporter: Hoss Man
 Fix For: 3.3, 4.0

 Attachments: LUCENE-3226.patch


 A 3.2 user recently asked if something was wrong because CheckIndex was 
 reporting his (newly built) index version as...
 {noformat}
 Segments file=segments_or numSegments=1 version=FORMAT_3_1 [Lucene 3.1]
 {noformat}
 It seems like there are two very confusing pieces of information here...
 1) the variable name of SegmentInfos.FORMAT_3_1 seems like poor choice.  All 
 other FORMAT_* constants in SegmentInfos are descriptive of the actual change 
 made, and not specific to the version when they were introduced.
 2) whatever the name of the FORMAT_* variable, CheckIndex is labeling it 
 Lucene 3.1, which is missleading since that format is alwasy used in 3.2 
 (and probably 3.3, etc...).  
 I suggest:
 a) rename FORMAT_3_1 to something like FORMAT_SEGMENT_RECORDS_VERSION
 b) change CheckIndex so that the label for the newest format always ends 
 with  and later (ie: Lucene 3.1 and later) so when we release versions 
 w/o a format change we don't have to remember to manual list them in 
 CheckIndex.  when we *do* make format changes and update CheckIndex  and 
 later can be replaced with  to X.Y and the new format can be added

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-2305) DataImportScheduler - Marko Bonaci

2011-06-23 Thread Marko Bonaci (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-2305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marko Bonaci updated SOLR-2305:
---

Attachment: patch.txt

Patch for adding DIHScheduler v1.2 to Solr

 DataImportScheduler -  Marko Bonaci
 ---

 Key: SOLR-2305
 URL: https://issues.apache.org/jira/browse/SOLR-2305
 Project: Solr
  Issue Type: New Feature
Affects Versions: 4.0
Reporter: Bill Bell
 Fix For: 4.0

 Attachments: patch.txt


 Marko Bonaci has updated the WIKI page to add the DataImportScheduler, but I 
 cannot find a JIRA ticket for it?
 http://wiki.apache.org/solr/DataImportHandler
 Do we have a ticket so the code can be tracked?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Issue Comment Edited] (SOLR-2305) DataImportScheduler - Marko Bonaci

2011-06-23 Thread Marko Bonaci (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13053796#comment-13053796
 ] 

Marko Bonaci edited comment on SOLR-2305 at 6/23/11 11:17 AM:
--

This is patch for adding DIHScheduler v1.2 to Solr.
I didn't know I could make a patch concerning only 
org.apache.solr.handler.dataimport package :(
So finally, here it is.

Since I still have problems with build path/packages in Eclipse:
Wasn't tested at all.
No unit tests.
Whoever will be adding this please feel free to contact me if such a need 
arises.
Also, all criticism is more than welcome, I want to learn to do this the right 
way.

Thanks

  was (Author: mbonaci):
Patch for adding DIHScheduler v1.2 to Solr
  
 DataImportScheduler -  Marko Bonaci
 ---

 Key: SOLR-2305
 URL: https://issues.apache.org/jira/browse/SOLR-2305
 Project: Solr
  Issue Type: New Feature
Affects Versions: 4.0
Reporter: Bill Bell
 Fix For: 4.0

 Attachments: patch.txt


 Marko Bonaci has updated the WIKI page to add the DataImportScheduler, but I 
 cannot find a JIRA ticket for it?
 http://wiki.apache.org/solr/DataImportHandler
 Do we have a ticket so the code can be tracked?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2382) DIH Cache Improvements

2011-06-23 Thread Noble Paul (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13053800#comment-13053800
]

Noble Paul commented on SOLR-2382:
--

The patch applies well.

Suggestions

The SolrWriter/DIHPropertiesWriter abstarction can be a separate patch and I
can commit it right away . It may also have the changes for passing the handler
name .

The DIHCache should take the Context as a param and the EntityProcessor does
not need to make a copy of the attributes

DIH Cache Improvements
--

Key: SOLR-2382
URL: https://issues.apache.org/jira/browse/SOLR-2382
Project: Solr
Issue Type: New Feature
Components: contrib - DataImportHandler
Reporter: James Dyer
Priority: Minor
Attachments: SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch,
SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch

Functionality:
1. Provide a pluggable caching framework for DIH so that users can choose a
cache implementation that best suits their data and application.

2. Provide a means to temporarily cache a child Entity's data without
needing to create a special cached implementation of the Entity Processor
(such as CachedSqlEntityProcessor).

3. Provide a means to write the final (root entity) DIH output to a cache
rather than to Solr. Then provide a way for a subsequent DIH call to use the
cache as an Entity input. Also provide the ability to do delta updates on
such persistent caches.

4. Provide the ability to partition data across multiple caches that can
then be fed back into DIH and indexed either to varying Solr Shards, or to
the same Core in parallel.
Use Cases:
1. We needed a flexible scalable way to temporarily cache child-entity
data prior to joining to parent entities.
- Using SqlEntityProcessor with Child Entities can cause an n+1 select
problem.
- CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching
mechanism and does not scale.
- There is no way to cache non-SQL inputs (ex: flat files, xml, etc).

2. We needed the ability to gather data from long-running entities by a
process that runs separate from our main indexing process.

3. We wanted the ability to do a delta import of only the entities that
changed.
- Lucene/Solr requires entire documents to be re-indexed, even if only a
few fields changed.
- Our data comes from 50+ complex sql queries and/or flat files.
- We do not want to incur overhead re-gathering all of this data if only 1
entity's data changed.
- Persistent DIH caches solve this problem.

4. We want the ability to index several documents in parallel (using 1.4.1,
which did not have the threads parameter).

5. In the future, we may need to use Shards, creating a need to easily
partition our source data into Shards.
Implementation Details:
1. De-couple EntityProcessorBase from caching.
- Created a new interface, DIHCache two implementations:
- SortedMapBackedCache - An in-memory cache, used as default with
CachedSqlEntityProcessor (now deprecated).
- BerkleyBackedCache - A disk-backed cache, dependent on bdb-je, tested
with je-4.1.6.jar
- NOTE: the existing Lucene Contrib db project uses je-3.3.93.jar.
I believe this may be incompatible due to Generic Usage.
- NOTE: I did not modify the ant script to automatically get this jar,
so to use or evaluate this patch, download bdb-je from
http://www.oracle.com/technetwork/database/berkeleydb/downloads/index.html

2. Allow Entity Processors to take a cacheImpl parameter to cause the
entity data to be cached (see EntityProcessorBase DIHCacheProperties).

3. Partially De-couple SolrWriter from DocBuilder
- Created a new interface DIHWriter, two implementations:
- SolrWriter (refactored)
- DIHCacheWriter (allows DIH to write ultimately to a Cache).

4. Create a new Entity Processor, DIHCacheProcessor, which reads a
persistent Cache as DIH Entity Input.

5. Support a partition parameter with both DIHCacheWriter and
DIHCacheProcessor to allow for easy partitioning of source entity data.

6. Change the semantics of entity.destroy()
- Previously, it was being called on each iteration of
DocBuilder.buildDocument().
- Now it is does one-time cleanup tasks (like closing or deleting a
disk-backed cache) once the entity processor is completed.
- The only out-of-the-box entity processor that previously implemented
destroy() was LineEntitiyProcessor, so this is not a very invasive change.
General Notes:
We are near completion in converting our search functionality from a legacy
search engine to Solr. However, I found that DIH did not support caching to
the level of our prior product's data import utility. In order to

[JENKINS-MAVEN] Lucene-Solr-Maven-3.x #160: POMs out of sync

2011-06-23 Thread Apache Jenkins Server

Build: https://builds.apache.org/job/Lucene-Solr-Maven-3.x/160/

No tests ran.

Build Log (for compile errors):
[...truncated 8855 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-695) Improve BufferedIndexInput.readBytes() performance

[
https://issues.apache.org/jira/browse/LUCENE-695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Michael McCandless updated LUCENE-695:
--

Comment: was deleted

(was: Read here http://customized-dog-collars.com)

Improve BufferedIndexInput.readBytes() performance
--

Key: LUCENE-695
URL: https://issues.apache.org/jira/browse/LUCENE-695
Project: Lucene - Java
Issue Type: Improvement
Components: core/store
Affects Versions: 2.0.0
Reporter: Nadav Har'El
Priority: Minor
Attachments: readbytes.patch, readbytes.patch

During a profiling session, I discovered that BufferedIndexInput.readBytes(),
the function which reads a bunch of bytes from an index, is very inefficient
in many cases. It is efficient for one or two bytes, and also efficient
for a very large number of bytes (e.g., when the norms are read all at once);
But for anything in between (e.g., 100 bytes), it is a performance disaster.
It can easily be improved, though, and below I include a patch to do that.
The basic problem in the existing code was that if you ask it to read 100
bytes, readBytes() simply calls readByte() 100 times in a loop, which means
we check byte after byte if the buffer has another character, instead of just
checking once how many bytes we have left, and copy them all at once.
My version, attached below, copies these 100 bytes if they are available at
bulk (using System.arraycopy), and if less than 100 are available, whatever
is available gets copied, and then the rest. (as before, when a very large
number of bytes is requested, it is read directly into the final buffer).
In my profiling, this fix caused amazing performance
improvement: previously, BufferedIndexInput.readBytes() took as much as 25%
of the run time, and after the fix, this was down to 1% of the run time!
However, my scenario is *not* the typical Lucene code, but rather a version
of Lucene with added payloads, and these payloads average at 100 bytes, where
the original readBytes() did worst. I expect that my fix will have less of an
impact on vanilla Lucene, but it still can have an impact because it is
used for things like reading fields. (I am not aware of a standard Lucene
benchmark, so I can't provide benchmarks on a more typical case).
In addition to the change to readBytes(), my attached patch also adds a new
unit test to BufferedIndexInput (which previously did not have a unit test).
This test simulates a file which contains a predictable series of bytes, and
then tries to read from it with readByte() and readButes() with various
sizes (many thousands of combinations are tried) and see that exactly the
expected bytes are read. This test is independent of my new readBytes()
inplementation, and can be used to check the old implementation as well.
By the way, it's interesting that BufferedIndexOutput.writeBytes was already
efficient, and wasn't simply a loop of writeByte(). Only the reading code was
inefficient. I wonder why this happened.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-695) Improve BufferedIndexInput.readBytes() performance

2011-06-23 Thread Uparis Abeysena (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13053807#comment-13053807
]

Uparis Abeysena commented on LUCENE-695:

Click: http://customized-dog-collars.com

Improve BufferedIndexInput.readBytes() performance
--

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3226) rename SegmentInfos.FORMAT_3_1 and improve description in CheckIndex

[
https://issues.apache.org/jira/browse/LUCENE-3226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13053814#comment-13053814
]

Michael McCandless commented on LUCENE-3226:

bq. What is the benefit of naming the constant according to what has changed?

Because then devs trying to work w/ the code have some sense of what the change
was? EG for debugging maybe it's helpful, eg if something has gone wrong,
later, in how SegmentInfos is handling that version or what not.

bq. And what if two changes occur in the same release?

Well, we can handle that case by case? I agree it's messy... maybe pick a name
describing/subsuming both? Or favor one name (maybe the bigger change) and
use comments to explain the other change?
But if there is a comment/comments above the constant containing this same
information that's just as good...

bq. These constants, IMO, are used only to detect code that is needed to
support a certain version, and nothing more.

Right, but for the devs that need to revisit such code, it's helpful to know
what real change occurred within that version... else, during debugging
they'd have to eg go do some svn archaeology to understand the change.

bq. And since the purpose of LUCENE-2921 is to move all index format tracking
to be at the 'code'-level and not 'feature'-level, I'd assume the constants
would be named accordingly.

True... so maybe we take this up under that issue? I would be OK with just
having comments that describe what changed in each version...

So for this issue maybe re-commit just the CheckIndex fix, and leave the
constant naming fixes to LUCENE-2921?

rename SegmentInfos.FORMAT_3_1 and improve description in CheckIndex

Key: LUCENE-3226
URL: https://issues.apache.org/jira/browse/LUCENE-3226
Project: Lucene - Java
Issue Type: Improvement
Affects Versions: 3.1, 3.2
Reporter: Hoss Man
Fix For: 3.3, 4.0

Attachments: LUCENE-3226.patch

A 3.2 user recently asked if something was wrong because CheckIndex was
reporting his (newly built) index version as...
{noformat}
Segments file=segments_or numSegments=1 version=FORMAT_3_1 [Lucene 3.1]
{noformat}
It seems like there are two very confusing pieces of information here...
1) the variable name of SegmentInfos.FORMAT_3_1 seems like poor choice. All
other FORMAT_* constants in SegmentInfos are descriptive of the actual change
made, and not specific to the version when they were introduced.
2) whatever the name of the FORMAT_* variable, CheckIndex is labeling it
Lucene 3.1, which is missleading since that format is alwasy used in 3.2
(and probably 3.3, etc...).
I suggest:
a) rename FORMAT_3_1 to something like FORMAT_SEGMENT_RECORDS_VERSION
b) change CheckIndex so that the label for the newest format always ends
with and later (ie: Lucene 3.1 and later) so when we release versions
w/o a format change we don't have to remember to manual list them in
CheckIndex. when we *do* make format changes and update CheckIndex and
later can be replaced with to X.Y and the new format can be added

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-695) Improve BufferedIndexInput.readBytes() performance

[
https://issues.apache.org/jira/browse/LUCENE-695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Michael McCandless updated LUCENE-695:
--

Comment: was deleted

(was: Click: http://customized-dog-collars.com)

Improve BufferedIndexInput.readBytes() performance
--

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3229) Overlaped SpanNearQuery

2011-06-23 Thread ludovic Boutros (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13053817#comment-13053817
]

ludovic Boutros commented on LUCENE-3229:
-

:To reduce surprises like this one when nested spans are used, the ordered case
might be changed to require no overlap at all.
:To do that one could compare the end of one spans with the beginning of the
next one.
:AFAIK none of the existing test cases uses a nested span query, so more some
test cases for that would be good to have.

The patch does exactly that.

:The docSpansOrdered method in NearSpansUnordered from the SpanOverLap2.diff
patch
:is the same as the existing docSpansOrdered method in NearSpansOrdered.
:That is probably not intended.

It is the same as the actual method because I don't want to modify the current
behavior of the NearSpansUnordered class.
Overlap should be allowed for unordered near span queries. And if I do not do
that, unit test is KO for unordered near span queries.

:Could you provide patches as decribed here:
http://wiki.apache.org/lucene-java/HowToContribute ?

Sorry for that, sure, I will provide the patch shortly.

Overlaped SpanNearQuery
---

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3229) Overlaped SpanNearQuery

2011-06-23 Thread ludovic Boutros (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ludovic Boutros updated LUCENE-3229:


Attachment: LUCENE-3229.patch

Here is the patch as described in the wiki.
Is it ok ?

 Overlaped SpanNearQuery
 ---

 Key: LUCENE-3229
 URL: https://issues.apache.org/jira/browse/LUCENE-3229
 Project: Lucene - Java
  Issue Type: Bug
  Components: core/search
Affects Versions: 3.1
 Environment: Windows XP, Java 1.6
Reporter: ludovic Boutros
Priority: Minor
 Attachments: LUCENE-3229.patch, SpanOverlap.diff, SpanOverlap2.diff, 
 SpanOverlapTestUnit.diff


 While using Span queries I think I've found a little bug.
 With a document like this (from the TestNearSpansOrdered unit test) :
 w1 w2 w3 w4 w5
 If I try to search for this span query :
 spanNear([spanNear([field:w3, field:w5], 1, true), field:w4], 0, true)
 the above document is returned and I think it should not because 'w4' is not 
 after 'w5'.
 The 2 spans are not ordered, because there is an overlap.
 I will add a test patch in the TestNearSpansOrdered unit test.
 I will add a patch to solve this issue too.
 Basicaly it modifies the two docSpansOrdered functions to make sure that the 
 spans does not overlap.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3232) Move MutableValues to Common Module

[
https://issues.apache.org/jira/browse/LUCENE-3232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13053826#comment-13053826
]

Michael McCandless commented on LUCENE-3232:

Patch looks great!

I wonder if we should name this module something more specific, eg docvalues?
values?

Should we also move over ValueSource, *DocValues, FieldCacheSource? I think,
then, Solr 3.x grouping could cutover and then group by other field types.

Move MutableValues to Common Module
---

Key: LUCENE-3232
URL: https://issues.apache.org/jira/browse/LUCENE-3232
Project: Lucene - Java
Issue Type: Sub-task
Components: core/search
Reporter: Chris Male
Fix For: 4.0

Attachments: LUCENE-3232.patch, LUCENE-3232.patch

Solr makes use of the MutableValue* series of classes to improve performance
of grouping by FunctionQuery (I think). As such they are used in ValueSource
implementations. Consequently we need to move these classes in order to move
the ValueSources.
As Yonik pointed out, these classes have use beyond just FunctionQuerys and
might be used by both Solr and other modules. However I don't think they
belong in Lucene core, since they aren't really related to search
functionality. Therefore I think we should put them into a Common module,
which can serve as a dependency to Solr and any module.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (SOLR-2618) Indexing and search on more then one object

2011-06-23 Thread Monica Storfjord (JIRA)

Indexing and search on more then one object
---

 Key: SOLR-2618
 URL: https://issues.apache.org/jira/browse/SOLR-2618
 Project: Solr
  Issue Type: Improvement
  Components: clients - java
Affects Versions: 3.2
Reporter: Monica Storfjord
Priority: Minor


It would be very beneficial for a project that I am currently working on to 
have the ability to index and search on various subclasses of an object and map 
the objects directly to the actual domain-object. We are planning to do an 
implementation of this feature but if it is a Solr plugin or something that 
introduce this feature already if will reduce the development time for us 
greatly!

We are using SolrJ against an Apache Solr 3.2 instance to index, change and 
search. It should be possible to make a solution that map against a special 
type field( field name=classtype type=class) in schemas.xml that are 
indexed every time and use reflection against the actual class?

- Monica


 


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3226) rename SegmentInfos.FORMAT_3_1 and improve description in CheckIndex

2011-06-23 Thread Shai Erera (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-3226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13053830#comment-13053830
]

Shai Erera commented on LUCENE-3226:

bq. I would be OK with just having comments that describe what changed in each
version...

Yeah, that's what I thought. Constant name denotes the code version,
documentation denotes the actual changes.

bq. So for this issue maybe re-commit just the CheckIndex fix

I think that that's what Robert and I agreed to do, and we moved to discuss
what should be the actual message printed, so it's less confusing to the users.

rename SegmentInfos.FORMAT_3_1 and improve description in CheckIndex

Key: LUCENE-3226
URL: https://issues.apache.org/jira/browse/LUCENE-3226
Project: Lucene - Java
Issue Type: Improvement
Affects Versions: 3.1, 3.2
Reporter: Hoss Man
Fix For: 3.3, 4.0

Attachments: LUCENE-3226.patch

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3225) Optimize TermsEnum.seek when caller doesn't need next term

[
https://issues.apache.org/jira/browse/LUCENE-3225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13053844#comment-13053844
]

Michael McCandless commented on LUCENE-3225:

bq. Mike this seems like a good improvement but I think letting a user change
the behavior of method X by passing true / false to method Y is no good. I
think this is kind of error prone plus its cluttering the seek method though.
Once Boolean is enough here. I think we should rather restrict this to allow
users to pull an exactMatchOnly TermsEnum which does only support exact matches
and throws a clear exception if next is called. I know that makes things
slightly harder especially to deal with our ThreadLocal cached TermsEnum
instances but I think that is better here.

Well, it only means the enum is unpositioned if you get back
NOT_FOUND? Ie, it's just like if you get back null from next(), or
END from seek(): in these cases, the enum is unpositioned and you need
to call seek again.

My worry if we force an up-front decision here (exact only enum vs
non-exact only enum) is we prevent legitimate use cases where the
caller wants to mix match with one enum.

EG, when AutomatonQuery intersects w/ the terms, when it hits are
region where terms are denser than what the automaton will accept
(such as an infinite part), it should use exact seeking, but then
when it's in a region where terms are less dense (eg a finite part)
it should use exact seeking I'll open a separate issue for this.

The TermsEnum impls can be efficient in this case, ie re-using
internal seek state for the exat and non-exact cases (MemoryCodec does
this).

But I agree another boolean to seek isn't great; maybe instead we can
make a seperate seekExact method? Default impl would just call seek
(and get no perf gains).

BTW, similarly, I think we have a missing API in DISI (for
scoring): advance always does a next() if the target doc doesn't
match. But we can get substantial performance gains in some cases
(see LUCENE-1536) if we had an advanceExact that would not do the
next and simply tell us if this doc matched or not.

bq. Can we somehow leave the extra CPU work to the term() call and make this
entirely lazy?

Not sure what you meant here?

Optimize TermsEnum.seek when caller doesn't need next term
--

Key: LUCENE-3225
URL: https://issues.apache.org/jira/browse/LUCENE-3225
Project: Lucene - Java
Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
Fix For: 4.0

Attachments: LUCENE-3225.patch

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3225) Optimize TermsEnum.seek when caller doesn't need next term


[ 
https://issues.apache.org/jira/browse/LUCENE-3225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13053846#comment-13053846
 ] 

Michael McCandless commented on LUCENE-3225:


This patch gives nice gains for MemoryCodec: I did a quick test w/ my
NRT stress test (reopen at 2X Twitter's peak indexing rate) and the
reopen time dropped from ~49 msec to ~43 msec (~12% faster).  This is
impressive because resolving deletes is just one part of opening the
NRT reader, ie we also must write the new segment, open SegmentReader
against it, etc.


 Optimize TermsEnum.seek when caller doesn't need next term
 --

 Key: LUCENE-3225
 URL: https://issues.apache.org/jira/browse/LUCENE-3225
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 4.0

 Attachments: LUCENE-3225.patch


 Some codecs are able to save CPU if the caller is only interested in
 exact matches.  EG, Memory codec and SimpleText can do more efficient
 FSTEnum lookup if they know the caller doesn't need to know the term
 following the seek term.
 We have cases like this in Lucene, eg when IW deletes documents by
 Term, if the term is not found in a given segment then it doesn't need
 to know the ceiling term.  Likewise when TermQuery looks up the term
 in each segment.
 I had done this change as part of LUCENE-3030, which is a new terms
 index that's able to save seeking for exact-only lookups, but now that
 we have Memory codec that can also save CPU I think we should commit
 this today.
 The change adds a boolean onlyExact param to seek(BytesRef).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3079) Facetiing module

2011-06-23 Thread Shai Erera (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13053847#comment-13053847
 ] 

Shai Erera commented on LUCENE-3079:


I would like to contribute IBM's faceted search package (been wanting to do 
that for quite a while). The package supports the following 
features/capabilities (at a high-level):

* Taxonomy index -- manages trees of 'categories'. You can view example of 
taxonomies at e.g. the Open Directory Project.
** It's a Lucene index managed alongside the content index.
** Builds the taxonomy on-the-fly (i.e. as categories are discovered).
** In general it maps a category hierarchy to ordinals (integers). For example, 
the category /date/2011/06/24 will create the following entry in the taxonomy 
index:
*** /date, ordinal=1
*** /date/2011, ordinal=2
*** /date/2011/06, ordinal=3
*** /date/2011/06/24, ordinal=4

* FacetsDocumentBuilder which receives a list of categories that are associated 
w/ the document (can be of several dimensions) and:
** Fetches the ordinals of the category components from the taxonomy index 
(adding them to it on-the-fly).
** Indexes them in a (compressed) payload for the document (so for the above 
category example, 4 payloads will be indexed for the document).
** FDB can be used to augment a Document with other fields for indexing (it 
adds its own Field objects).

* FacetsCollector receives a handle to the taxonomy, a list of facet 'roots' to 
count and returns the top-K categories for each requested facet:
** The root can denote any node in the category tree (e.g., 'count all facets 
under /date/2011')
** top-K can be returned for the top most K immediate children of root, or any 
top-K in the sub-tree of root.

* Counting algorithm (at a high-level):
** Fetch the payload for every matching document.
** Increment by 1 the count of every ordinal that was encountered (even for 
facets that were not requested by the user)
** After all ordinals are counted, compute the top-K on the ones the user 
requested
** Label the result facets

* Miscellaneous features:
** *Sampling* algorithm allows for more efficient facets counting/accumulation, 
while still returning exact counts for the top-K facets.
** *Complements* algorithm allows for more efficient facets 
counting/accumulation, when the number of results is  50% of the docs in the 
index (we keep a total count of facets, count facets on the docs that did not 
match the query and subtract).
*** Complements can be used to count facets that do not appear in any of the 
matching documents (of this result set). This does not exist in the package 
though ... yet.
** *Facets partitioning* -- if the taxonomy is huge (i.e. millions of 
categories), it is better to partition them at indexing time, so that search 
time is faster and consumes less memory. Note that this is required because of 
the approach of counting all (allocating a count array) and then keeping only 
the results of interest.
** *Category enhancements* allow storing 'metadata' with categories in the 
index, so that more than simple counting can be implemented:
*** *weighted facets* (built on top of enhancements) allows associating a 
weight w/ each category, and use smarter counting techniques at runtime. For 
example, if facets are generated by an analytics component, the confidence 
level can be set as the category's weight. If tags are indexed as facets (for 
e.g. generating a tag cloud for the result set), the number of times the 
document was tagged by the tag can be set as the tag's weight.
** That that facets are indexed in the payloads of documents allows managing 
very large taxonomies and indexes, without blowing up the RAM at runtime (but 
incur some search performance overhead). However, the payloads can be loaded up 
to RAM (like in FieldCache) in which case runtime becomes much faster.
*** However facets are stored is abstracted though by a CountingList API, so we 
can definitely explore other means of storing the facet ordinals. Actually, the 
CountingList API allows us to read the ordinals from disk or RAM w/o affecting 
the rest of the algorithm at all.
** I did not want to dive too deep on the API here, but the runtime API is very 
extensible and allows one to use FacetsCollector for the simple cases, or 
lower-level API to get more control on the process. You can look at: 
FacetRequest, FacetSearchParams, FacetResult, FacetResultNode, FacetsCollector, 
FacetsAccumulator, FacetsAggregator for a more extensive set of API to use.

* The package comes with example code which shows how to use the different 
features I've mentioned. There are also unit tests for ensuring the example 
code works :).

* The package comes with a very extensive tests suite and is in use by many of 
our products for a long time, so I can state that it's very stable.

* Some rough performance numbers:
** Collection of 1M documents, few hierarchical

[jira] [Commented] (LUCENE-3226) rename SegmentInfos.FORMAT_3_1 and improve description in CheckIndex


[ 
https://issues.apache.org/jira/browse/LUCENE-3226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13053859#comment-13053859
 ] 

Michael McCandless commented on LUCENE-3226:


OK I agree, I think :)

Who will re-commit the CheckIndex fix here...?

 rename SegmentInfos.FORMAT_3_1 and improve description in CheckIndex
 

 Key: LUCENE-3226
 URL: https://issues.apache.org/jira/browse/LUCENE-3226
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 3.1, 3.2
Reporter: Hoss Man
 Fix For: 3.3, 4.0

 Attachments: LUCENE-3226.patch


 A 3.2 user recently asked if something was wrong because CheckIndex was 
 reporting his (newly built) index version as...
 {noformat}
 Segments file=segments_or numSegments=1 version=FORMAT_3_1 [Lucene 3.1]
 {noformat}
 It seems like there are two very confusing pieces of information here...
 1) the variable name of SegmentInfos.FORMAT_3_1 seems like poor choice.  All 
 other FORMAT_* constants in SegmentInfos are descriptive of the actual change 
 made, and not specific to the version when they were introduced.
 2) whatever the name of the FORMAT_* variable, CheckIndex is labeling it 
 Lucene 3.1, which is missleading since that format is alwasy used in 3.2 
 (and probably 3.3, etc...).  
 I suggest:
 a) rename FORMAT_3_1 to something like FORMAT_SEGMENT_RECORDS_VERSION
 b) change CheckIndex so that the label for the newest format always ends 
 with  and later (ie: Lucene 3.1 and later) so when we release versions 
 w/o a format change we don't have to remember to manual list them in 
 CheckIndex.  when we *do* make format changes and update CheckIndex  and 
 later can be replaced with  to X.Y and the new format can be added

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3231) Add fixed size DocValues int variants expose Arrays where possible


[ 
https://issues.apache.org/jira/browse/LUCENE-3231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13053865#comment-13053865
 ] 

Michael McCandless commented on LUCENE-3231:


This looks great Simon!

 Add fixed size DocValues int variants  expose Arrays where possible
 

 Key: LUCENE-3231
 URL: https://issues.apache.org/jira/browse/LUCENE-3231
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 4.0
Reporter: Simon Willnauer
 Fix For: 4.0

 Attachments: LUCENE-3231.patch


 currently we only have variable bit packed ints implementation. for flexible 
 scoring or loading field caches it is desirable to have fixed int 
 implementations for 8, 16, 32 and 64 bit. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-3233) HuperDuperSynonymsFilter™

HuperDuperSynonymsFilter™
-

 Key: LUCENE-3233
 URL: https://issues.apache.org/jira/browse/LUCENE-3233
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Robert Muir


The current synonymsfilter uses a lot of ram and cpu, especially at build time.

I think yesterday I heard about huge synonyms files three times.

So, I think we should use an FST-based structure, sharing the inputs and 
outputs.
And we should be more efficient with the tokenStream api, e.g. using 
save/restoreState instead of cloneAttributes()


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3233) HuperDuperSynonymsFilter™

[
https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Robert Muir updated LUCENE-3233:

Attachment: LUCENE-3233.patch

here's a rough start to building a datastructure that I think makes good
tradeoffs between RAM and processing.

No matter what, the processing on the filter-side will be hairy because of the
'interleaving' with the tokenstream.

This one is just an FSTCharsRef,Int[](BYTE4) where Int is an ord to a
BytesRefHash, containing the output Bytes for each term.

This way, at input time we can walk the FST with codePointAt()

On both sides, the Chars/Bytes are actually phrases, using \u as a word
separator.

HuperDuperSynonymsFilter™
-

Key: LUCENE-3233
URL: https://issues.apache.org/jira/browse/LUCENE-3233
Project: Lucene - Java
Issue Type: Improvement
Reporter: Robert Muir
Attachments: LUCENE-3233.patch

The current synonymsfilter uses a lot of ram and cpu, especially at build
time.
I think yesterday I heard about huge synonyms files three times.
So, I think we should use an FST-based structure, sharing the inputs and
outputs.
And we should be more efficient with the tokenStream api, e.g. using
save/restoreState instead of cloneAttributes()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2610) Add an option to delete index through CoreAdmin UNLOAD action

2011-06-23 Thread Mark Miller (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13053870#comment-13053870
 ] 

Mark Miller commented on SOLR-2610:
---

But you *might* want to (in fact, I do this). If you are really done with a 
core, if you *really* want to remove it, what do you need the config files 
around for anymore? Seems like a reasonable option to me - makes no sense as 
the default I'd agree with.

nukeEverything=true ;)

 Add an option to delete index through CoreAdmin UNLOAD action
 -

 Key: SOLR-2610
 URL: https://issues.apache.org/jira/browse/SOLR-2610
 Project: Solr
  Issue Type: Improvement
  Components: multicore
Reporter: Shalin Shekhar Mangar
Assignee: Shalin Shekhar Mangar
Priority: Minor
 Fix For: 3.3, 4.0

 Attachments: SOLR-2610-branch3x.patch, SOLR-2610.patch


 Right now, one can unload a Solr Core but the index files are left behind and 
 consume disk space. We should have an option to delete the index when 
 unloading a core.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3230) Make FSDirectory.fsync() public and static


[ 
https://issues.apache.org/jira/browse/LUCENE-3230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13053872#comment-13053872
 ] 

Michael McCandless commented on LUCENE-3230:


This seems OK, but my only worry is I'm not sure this way of fsync'ing 
really works?  Ie, this code opens a r/w RAF, calls sync, closes it.  It's 
not clear that this is guaranteed to sync file handles open in the past against 
the same file.  This is something we separately should look into / fix, but 
with this uncertainty it makes me nervous exposing this as a public API... 
maybe we could expose it with a big warning.

bq. Also, while reviewing the code, I noticed that if IOE occurs, the code 
sleeps for 5 msec. If an InterruptedException occurs then, it immediately 
throws ThreadIE, completely ignoring the fact that it slept due to IOE. 
Shouldn't we at least pass IOE.getMessage() on ThreadIE?

+1

 Make FSDirectory.fsync() public and static
 --

 Key: LUCENE-3230
 URL: https://issues.apache.org/jira/browse/LUCENE-3230
 Project: Lucene - Java
  Issue Type: New Feature
  Components: core/store
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Minor
 Fix For: 3.3, 4.0


 I find FSDirectory.fsync() (today protected and instance method) very useful 
 as a utility to sync() files. I'd like create a FSDirectory.sync() utility 
 which contains the exact same impl of FSDir.fsync(), and have the latter call 
 it. We can have it part of IOUtils too, as it's a completely standalone 
 utility.
 I would get rid of FSDir.fsync() if it wasn't protected (as if encouraging 
 people to override it). I doubt anyone really overrides it (our core 
 Directories don't).
 Also, while reviewing the code, I noticed that if IOE occurs, the code sleeps 
 for 5 msec. If an InterruptedException occurs then, it immediately throws 
 ThreadIE, completely ignoring the fact that it slept due to IOE. Shouldn't we 
 at least pass IOE.getMessage() on ThreadIE?
 The patch is trivial, so I'd like to get some feedback before I post it.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3232) Move MutableValues to Common Module


[ 
https://issues.apache.org/jira/browse/LUCENE-3232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13053876#comment-13053876
 ] 

Simon Willnauer commented on LUCENE-3232:
-

bq. I wonder if we should name this module something more specific, eg 
docvalues? values?
dude! no! :)

 Move MutableValues to Common Module
 ---

 Key: LUCENE-3232
 URL: https://issues.apache.org/jira/browse/LUCENE-3232
 Project: Lucene - Java
  Issue Type: Sub-task
  Components: core/search
Reporter: Chris Male
 Fix For: 4.0

 Attachments: LUCENE-3232.patch, LUCENE-3232.patch


 Solr makes use of the MutableValue* series of classes to improve performance 
 of grouping by FunctionQuery (I think).  As such they are used in ValueSource 
 implementations.  Consequently we need to move these classes in order to move 
 the ValueSources.
 As Yonik pointed out, these classes have use beyond just FunctionQuerys and 
 might be used by both Solr and other modules.  However I don't think they 
 belong in Lucene core, since they aren't really related to search 
 functionality.  Therefore I think we should put them into a Common module, 
 which can serve as a dependency to Solr and any module.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3225) Optimize TermsEnum.seek when caller doesn't need next term

[
https://issues.apache.org/jira/browse/LUCENE-3225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13053885#comment-13053885
]

Simon Willnauer commented on LUCENE-3225:
-

{quote}
BTW, similarly, I think we have a missing API in DISI (for
scoring): advance always does a next() if the target doc doesn't
match. But we can get substantial performance gains in some cases
(see LUCENE-1536) if we had an advanceExact that would not do the
next and simply tell us if this doc matched or not.
{quote}

+1!!
{quote}
But I agree another boolean to seek isn't great; maybe instead we can
make a seperate seekExact method? Default impl would just call seek
(and get no perf gains).
{quote}

thats another option and I like that better though. Yet the other should the be
seekFloor no?

bq. not sure what you meant here?

nevermind I only looked at the top of the patch and figured that we only safe
the loading into bytesref but there is more about it...

Optimize TermsEnum.seek when caller doesn't need next term
--

Key: LUCENE-3225
URL: https://issues.apache.org/jira/browse/LUCENE-3225
Project: Lucene - Java
Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
Fix For: 4.0

Attachments: LUCENE-3225.patch

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3079) Facetiing module

2011-06-23 Thread JIRA


[ 
https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13053886#comment-13053886
 ] 

Jan Høydahl commented on LUCENE-3079:
-

Bravo Shai  IBM!

 Facetiing module
 

 Key: LUCENE-3079
 URL: https://issues.apache.org/jira/browse/LUCENE-3079
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
 Attachments: LUCENE-3079.patch


 Faceting is a hugely important feature, available in Solr today but
 not [easily] usable by Lucene-only apps.
 We should fix this, by creating a shared faceting module.
 Ideally, we factor out Solr's faceting impl, and maybe poach/merge
 from other impls (eg Bobo browse).
 Hoss describes some important challenges we'll face in doing this
 (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here:
 {noformat}
 To look at faceting as a concrete example, there are big the reasons 
 faceting works so well in Solr: Solr has total control over the 
 index, knows exactly when the index has changed to rebuild caches, has a 
 strict schema so it can make sense of field types and 
 pick faceting algos accordingly, has multi-phase distributed search 
 approach to get exact counts efficiently across multiple shards, etc...
 (and there are still a lot of additional enhancements and improvements 
 that can be made to take even more advantage of knowledge solr has because 
 it owns the index that we no one has had time to tackle)
 {noformat}
 This is a great list of the things we face in refactoring.  It's also
 important because, if Solr needed to be so deeply intertwined with
 caching, schema, etc., other apps that want to facet will have the
 same needs and so we really have to address them in creating the
 shared module.
 I think we should get a basic faceting module started, but should not
 cut Solr over at first.  We should iterate on the module, fold in
 improvements, etc., and then, once we can fully verify that cutting
 over doesn't hurt Solr (ie lose functionality or performance) we can
 later cutover.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[JENKINS-MAVEN] Lucene-Solr-Maven-trunk #157: POMs out of sync

2011-06-23 Thread Apache Jenkins Server

Build: https://builds.apache.org/job/Lucene-Solr-Maven-trunk/157/

No tests ran.

Build Log (for compile errors):
[...truncated 7493 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3079) Facetiing module

2011-06-23 Thread Ryan McKinley (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13053940#comment-13053940
 ] 

Ryan McKinley commented on LUCENE-3079:
---

bq. Bravo Shai  IBM!

+1!  This sounds awesome, and I hope will prove how modules will help lucene 
*and* solr

 Facetiing module
 

 Key: LUCENE-3079
 URL: https://issues.apache.org/jira/browse/LUCENE-3079
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
 Attachments: LUCENE-3079.patch


 Faceting is a hugely important feature, available in Solr today but
 not [easily] usable by Lucene-only apps.
 We should fix this, by creating a shared faceting module.
 Ideally, we factor out Solr's faceting impl, and maybe poach/merge
 from other impls (eg Bobo browse).
 Hoss describes some important challenges we'll face in doing this
 (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here:
 {noformat}
 To look at faceting as a concrete example, there are big the reasons 
 faceting works so well in Solr: Solr has total control over the 
 index, knows exactly when the index has changed to rebuild caches, has a 
 strict schema so it can make sense of field types and 
 pick faceting algos accordingly, has multi-phase distributed search 
 approach to get exact counts efficiently across multiple shards, etc...
 (and there are still a lot of additional enhancements and improvements 
 that can be made to take even more advantage of knowledge solr has because 
 it owns the index that we no one has had time to tackle)
 {noformat}
 This is a great list of the things we face in refactoring.  It's also
 important because, if Solr needed to be so deeply intertwined with
 caching, schema, etc., other apps that want to facet will have the
 same needs and so we really have to address them in creating the
 shared module.
 I think we should get a basic faceting module started, but should not
 cut Solr over at first.  We should iterate on the module, fold in
 improvements, etc., and then, once we can fully verify that cutting
 over doesn't hurt Solr (ie lose functionality or performance) we can
 later cutover.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3225) Optimize TermsEnum.seek when caller doesn't need next term

[
https://issues.apache.org/jira/browse/LUCENE-3225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13053964#comment-13053964
]

Michael McCandless commented on LUCENE-3225:

bq. Yet the other should the be seekFloor no?

Ahhh right, we had discussed on the dev list. I agree!

But, we should do this in another issue. Though, I think we should rename the
current seek to seekCeil; I'll do that here.

Optimize TermsEnum.seek when caller doesn't need next term
--

Key: LUCENE-3225
URL: https://issues.apache.org/jira/browse/LUCENE-3225
Project: Lucene - Java
Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
Fix For: 4.0

Attachments: LUCENE-3225.patch

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2610) Add an option to delete index through CoreAdmin UNLOAD action

2011-06-23 Thread Jason Rutherglen (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13053992#comment-13053992
 ] 

Jason Rutherglen commented on SOLR-2610:


Mark put it aptly.  The problem I think I encountered in my own version is left 
over file handles seemed to be preventing the deletion of all the files, many 
times some of them would be left over.  Also I deleted the entire core 
directory, which is useful for manual testing (eg, to avoid the directory 
exists exception).

 Add an option to delete index through CoreAdmin UNLOAD action
 -

 Key: SOLR-2610
 URL: https://issues.apache.org/jira/browse/SOLR-2610
 Project: Solr
  Issue Type: Improvement
  Components: multicore
Reporter: Shalin Shekhar Mangar
Assignee: Shalin Shekhar Mangar
Priority: Minor
 Fix For: 3.3, 4.0

 Attachments: SOLR-2610-branch3x.patch, SOLR-2610.patch


 Right now, one can unload a Solr Core but the index files are left behind and 
 consume disk space. We should have an option to delete the index when 
 unloading a core.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter

Provide limit on phrase analysis in FastVectorHighlighter
-

 Key: LUCENE-3234
 URL: https://issues.apache.org/jira/browse/LUCENE-3234
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Mike Sokolov


With larger documents, FVH can spend a lot of time trying to find the 
best-scoring snippet as it examines every possible phrase formed from matching 
terms in the document.  If one is willing to accept
less-than-perfect scoring by limiting the number of phrases that are examined, 
substantial speedups are possible.  This is analogous to the Highlighter limit 
on the number of characters to analyze.

The patch includes an artifical test case that shows  1000x speedup.  In a 
more normal test environment, with English documents and random queries, I am 
seeing speedups of around 3-10x when setting phraseLimit=1, which has the 
effect of selecting the first possible snippet in the document.  Most of our 
sites operate in this way (just show the first snippet), so this would be a big 
win for us.

With phraseLimit = -1, you get the existing FVH behavior. At larger values of 
phraseLimit, you may not get substantial speedup in the normal case, but you do 
get the benefit of protection against blow-up in pathological cases.


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter

[
https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Mike Sokolov updated LUCENE-3234:
-

Attachment: LUCENE-3234.patch

Provide limit on phrase analysis in FastVectorHighlighter
-

Key: LUCENE-3234
URL: https://issues.apache.org/jira/browse/LUCENE-3234
Project: Lucene - Java
Issue Type: Improvement
Reporter: Mike Sokolov
Attachments: LUCENE-3234.patch

With larger documents, FVH can spend a lot of time trying to find the
best-scoring snippet as it examines every possible phrase formed from
matching terms in the document. If one is willing to accept
less-than-perfect scoring by limiting the number of phrases that are
examined, substantial speedups are possible. This is analogous to the
Highlighter limit on the number of characters to analyze.
The patch includes an artifical test case that shows 1000x speedup. In a
more normal test environment, with English documents and random queries, I am
seeing speedups of around 3-10x when setting phraseLimit=1, which has the
effect of selecting the first possible snippet in the document. Most of our
sites operate in this way (just show the first snippet), so this would be a
big win for us.
With phraseLimit = -1, you get the existing FVH behavior. At larger values of
phraseLimit, you may not get substantial speedup in the normal case, but you
do get the benefit of protection against blow-up in pathological cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter


[ 
https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054007#comment-13054007
 ] 

Robert Muir commented on LUCENE-3234:
-

I like this tradeoff Mike, thanks!

should we consider setting some kind of absurd default like 10,000 to really 
prevent some pathological cases with huge documents?
We could document in CHANGES.txt that if you want the old behavior, set it to 
-1 or Integer.MAX_VALUE (I think we can use this here? offsets are ints?)

 Provide limit on phrase analysis in FastVectorHighlighter
 -

 Key: LUCENE-3234
 URL: https://issues.apache.org/jira/browse/LUCENE-3234
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Mike Sokolov
 Attachments: LUCENE-3234.patch


 With larger documents, FVH can spend a lot of time trying to find the 
 best-scoring snippet as it examines every possible phrase formed from 
 matching terms in the document.  If one is willing to accept
 less-than-perfect scoring by limiting the number of phrases that are 
 examined, substantial speedups are possible.  This is analogous to the 
 Highlighter limit on the number of characters to analyze.
 The patch includes an artifical test case that shows  1000x speedup.  In a 
 more normal test environment, with English documents and random queries, I am 
 seeing speedups of around 3-10x when setting phraseLimit=1, which has the 
 effect of selecting the first possible snippet in the document.  Most of our 
 sites operate in this way (just show the first snippet), so this would be a 
 big win for us.
 With phraseLimit = -1, you get the existing FVH behavior. At larger values of 
 phraseLimit, you may not get substantial speedup in the normal case, but you 
 do get the benefit of protection against blow-up in pathological cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter

[
https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054014#comment-13054014
]

Robert Muir commented on LUCENE-3234:
-

yeah, you are right.. but seeing as how positions are ints too, I think it
might be easier to do Integer.MAX_VALUE versus the -1 parameter.

Provide limit on phrase analysis in FastVectorHighlighter
-

Key: LUCENE-3234
URL: https://issues.apache.org/jira/browse/LUCENE-3234
Project: Lucene - Java
Issue Type: Improvement
Reporter: Mike Sokolov
Attachments: LUCENE-3234.patch

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter

[
https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054010#comment-13054010
]

Mike Sokolov commented on LUCENE-3234:
--

Yes, although a smaller number might be fine. Maybe Koji will comment: I don't
completely understand the scaling here, but it seemed to me that I had a case
with around 2000 occurrences of a term that lead to a 15-20 sec evaluation time
on my desktop. The max value will be an int, sire, although I think the number
is going to scale like positions, not offsets FWIW.

Provide limit on phrase analysis in FastVectorHighlighter
-

Key: LUCENE-3234
URL: https://issues.apache.org/jira/browse/LUCENE-3234
Project: Lucene - Java
Issue Type: Improvement
Reporter: Mike Sokolov
Attachments: LUCENE-3234.patch

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3229) Overlaped SpanNearQuery

2011-06-23 Thread Paul Elschot (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054026#comment-13054026
]

Paul Elschot commented on LUCENE-3229:
--

Thanks for bringing this up, this has confused more people in the past, and
that could well be over now.

Overlaped SpanNearQuery
---

Key: LUCENE-3229
URL: https://issues.apache.org/jira/browse/LUCENE-3229
Project: Lucene - Java
Issue Type: Bug
Components: core/search
Affects Versions: 3.1
Environment: Windows XP, Java 1.6
Reporter: ludovic Boutros
Priority: Minor
Attachments: LUCENE-3229.patch, LUCENE-3229.patch, SpanOverlap.diff,
SpanOverlap2.diff, SpanOverlapTestUnit.diff

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3229) Overlaped SpanNearQuery

2011-06-23 Thread Paul Elschot (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Paul Elschot updated LUCENE-3229:
-

Attachment: LUCENE-3229.patch

Basically the same functionality as previous patch by Ludovic Boutros.
Simplified the check for non overlapping spans, this might speed it up somewhat.
Added javadoc explanations on ordered without overlap and unordered with
overlap.
Minor spelling and indentation changes.

NearSpansOrdered might be further simplified as not all locals are actually
used now because of the simplified check, but for now I prefer to leave that to
the JIT to optimize away.

Overlaped SpanNearQuery
---

Key: LUCENE-3229
URL: https://issues.apache.org/jira/browse/LUCENE-3229
Project: Lucene - Java
Issue Type: Bug
Components: core/search
Affects Versions: 3.1
Environment: Windows XP, Java 1.6
Reporter: ludovic Boutros
Priority: Minor
Attachments: LUCENE-3229.patch, LUCENE-3229.patch, SpanOverlap.diff,
SpanOverlap2.diff, SpanOverlapTestUnit.diff

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3229) Overlaped SpanNearQuery

2011-06-23 Thread ludovic Boutros (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054045#comment-13054045
 ] 

ludovic Boutros commented on LUCENE-3229:
-

Thanks Paul,

do you have any idea when this patch will be applied to the branch 3x ?

 Overlaped SpanNearQuery
 ---

 Key: LUCENE-3229
 URL: https://issues.apache.org/jira/browse/LUCENE-3229
 Project: Lucene - Java
  Issue Type: Bug
  Components: core/search
Affects Versions: 3.1
 Environment: Windows XP, Java 1.6
Reporter: ludovic Boutros
Priority: Minor
 Attachments: LUCENE-3229.patch, LUCENE-3229.patch, SpanOverlap.diff, 
 SpanOverlap2.diff, SpanOverlapTestUnit.diff


 While using Span queries I think I've found a little bug.
 With a document like this (from the TestNearSpansOrdered unit test) :
 w1 w2 w3 w4 w5
 If I try to search for this span query :
 spanNear([spanNear([field:w3, field:w5], 1, true), field:w4], 0, true)
 the above document is returned and I think it should not because 'w4' is not 
 after 'w5'.
 The 2 spans are not ordered, because there is an overlap.
 I will add a test patch in the TestNearSpansOrdered unit test.
 I will add a patch to solve this issue too.
 Basicaly it modifies the two docSpansOrdered functions to make sure that the 
 spans does not overlap.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-3235) TestDoubleBarrelLRUCache hangs under Java 1.5, 3.x and trunk, likely JVM bug

TestDoubleBarrelLRUCache hangs under Java 1.5, 3.x and trunk, likely JVM bug


 Key: LUCENE-3235
 URL: https://issues.apache.org/jira/browse/LUCENE-3235
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Michael McCandless


Not sure what's going on yet... but under Java 1.6 it seems not to hang bug 
under Java 1.5 hangs fairly easily, on Linux.  Java is 1.5.0_22.

I suspect this is relevant: 
http://stackoverflow.com/questions/3292577/is-it-possible-for-concurrenthashmap-to-deadlock
 which refers to this JVM bug 
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6865591 which then refers to 
this one http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6822370

It looks like that last bug was fixed in Java 1.6 but not 1.5.


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3235) TestDoubleBarrelLRUCache hangs under Java 1.5, 3.x and trunk, likely JVM bug


[ 
https://issues.apache.org/jira/browse/LUCENE-3235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054066#comment-13054066
 ] 

Robert Muir commented on LUCENE-3235:
-

+1 to drop java 5

 TestDoubleBarrelLRUCache hangs under Java 1.5, 3.x and trunk, likely JVM bug
 

 Key: LUCENE-3235
 URL: https://issues.apache.org/jira/browse/LUCENE-3235
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Michael McCandless

 Not sure what's going on yet... but under Java 1.6 it seems not to hang bug 
 under Java 1.5 hangs fairly easily, on Linux.  Java is 1.5.0_22.
 I suspect this is relevant: 
 http://stackoverflow.com/questions/3292577/is-it-possible-for-concurrenthashmap-to-deadlock
  which refers to this JVM bug 
 http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6865591 which then refers 
 to this one http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6822370
 It looks like that last bug was fixed in Java 1.6 but not 1.5.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3235) TestDoubleBarrelLRUCache hangs under Java 1.5, 3.x and trunk, likely JVM bug

2011-06-23 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054067#comment-13054067
 ] 

Uwe Schindler commented on LUCENE-3235:
---

LOL, no comment.

 TestDoubleBarrelLRUCache hangs under Java 1.5, 3.x and trunk, likely JVM bug
 

 Key: LUCENE-3235
 URL: https://issues.apache.org/jira/browse/LUCENE-3235
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Michael McCandless

 Not sure what's going on yet... but under Java 1.6 it seems not to hang bug 
 under Java 1.5 hangs fairly easily, on Linux.  Java is 1.5.0_22.
 I suspect this is relevant: 
 http://stackoverflow.com/questions/3292577/is-it-possible-for-concurrenthashmap-to-deadlock
  which refers to this JVM bug 
 http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6865591 which then refers 
 to this one http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6822370
 It looks like that last bug was fixed in Java 1.6 but not 1.5.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3235) TestDoubleBarrelLRUCache hangs under Java 1.5, 3.x and trunk, likely JVM bug

[
https://issues.apache.org/jira/browse/LUCENE-3235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054070#comment-13054070
]

Robert Muir commented on LUCENE-3235:
-

i ran the test with the same version as mike (1.5.0_22) in two ways on windows:
* -Dtests.iter=100
* in a loop from a script, 100 times with its own ant run.

i can't reproduce it on windows.

in my eyes, there isn't even an argument about whether or not we should support
java5: its not possible, if bugs are not getting fixed.

TestDoubleBarrelLRUCache hangs under Java 1.5, 3.x and trunk, likely JVM bug

Key: LUCENE-3235
URL: https://issues.apache.org/jira/browse/LUCENE-3235
Project: Lucene - Java
Issue Type: Bug
Reporter: Michael McCandless

Not sure what's going on yet... but under Java 1.6 it seems not to hang bug
under Java 1.5 hangs fairly easily, on Linux. Java is 1.5.0_22.
I suspect this is relevant:
http://stackoverflow.com/questions/3292577/is-it-possible-for-concurrenthashmap-to-deadlock
which refers to this JVM bug
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6865591 which then refers
to this one http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6822370
It looks like that last bug was fixed in Java 1.6 but not 1.5.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3235) TestDoubleBarrelLRUCache hangs under Java 1.5, 3.x and trunk, likely JVM bug


[ 
https://issues.apache.org/jira/browse/LUCENE-3235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054071#comment-13054071
 ] 

Michael McCandless commented on LUCENE-3235:


Still hangs if I run -client; but it looks like -Xint prevents the hang (235 
iterations so far on beast).

3.2 also hangs.

 TestDoubleBarrelLRUCache hangs under Java 1.5, 3.x and trunk, likely JVM bug
 

 Key: LUCENE-3235
 URL: https://issues.apache.org/jira/browse/LUCENE-3235
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Michael McCandless

 Not sure what's going on yet... but under Java 1.6 it seems not to hang bug 
 under Java 1.5 hangs fairly easily, on Linux.  Java is 1.5.0_22.
 I suspect this is relevant: 
 http://stackoverflow.com/questions/3292577/is-it-possible-for-concurrenthashmap-to-deadlock
  which refers to this JVM bug 
 http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6865591 which then refers 
 to this one http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6822370
 It looks like that last bug was fixed in Java 1.6 but not 1.5.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[VOTE] release 3.3

2011-06-23 Thread Robert Muir

Artifacts here:

http://s.apache.org/lusolr33rc0

working release notes here:

http://wiki.apache.org/lucene-java/ReleaseNote33
http://wiki.apache.org/solr/ReleaseNote33

I ran the automated release test script in
trunk/dev-tools/scripts/smokeTestRelease.py, and ran 'ant test' at the
top level 50 times on windows.
Here is my +1

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2610) Add an option to delete index through CoreAdmin UNLOAD action

2011-06-23 Thread Shawn Heisey (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054079#comment-13054079
 ] 

Shawn Heisey commented on SOLR-2610:


I can think of a corollary core action I'd like to see -- the ability on a core 
RELOAD to entirely delete the index from a core and replace it with a fresh 
empty index that will start building at segment _0.  I would do this to my 
build core before using it, and later after swapping it with the live core 
and ensuring it's good, to free up disk space.

 Add an option to delete index through CoreAdmin UNLOAD action
 -

 Key: SOLR-2610
 URL: https://issues.apache.org/jira/browse/SOLR-2610
 Project: Solr
  Issue Type: Improvement
  Components: multicore
Reporter: Shalin Shekhar Mangar
Assignee: Shalin Shekhar Mangar
Priority: Minor
 Fix For: 3.3, 4.0

 Attachments: SOLR-2610-branch3x.patch, SOLR-2610.patch


 Right now, one can unload a Solr Core but the index files are left behind and 
 consume disk space. We should have an option to delete the index when 
 unloading a core.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter


[ 
https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054093#comment-13054093
 ] 

Mike Sokolov commented on LUCENE-3234:
--

Yes, that makes sense to me - default to 5000, say, and set explicitly to 
either MAX_VALUE or -1 to get the unlimited behavior (I prefer to allow -1 
since otherwise you should probably treat it as an error).  Do you want me to 
change the patch, or should I just leave that to the committer?

 Provide limit on phrase analysis in FastVectorHighlighter
 -

 Key: LUCENE-3234
 URL: https://issues.apache.org/jira/browse/LUCENE-3234
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Mike Sokolov
 Attachments: LUCENE-3234.patch


 With larger documents, FVH can spend a lot of time trying to find the 
 best-scoring snippet as it examines every possible phrase formed from 
 matching terms in the document.  If one is willing to accept
 less-than-perfect scoring by limiting the number of phrases that are 
 examined, substantial speedups are possible.  This is analogous to the 
 Highlighter limit on the number of characters to analyze.
 The patch includes an artifical test case that shows  1000x speedup.  In a 
 more normal test environment, with English documents and random queries, I am 
 seeing speedups of around 3-10x when setting phraseLimit=1, which has the 
 effect of selecting the first possible snippet in the document.  Most of our 
 sites operate in this way (just show the first snippet), so this would be a 
 big win for us.
 With phraseLimit = -1, you get the existing FVH behavior. At larger values of 
 phraseLimit, you may not get substantial speedup in the normal case, but you 
 do get the benefit of protection against blow-up in pathological cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-3236) Make LowerCaseFilter and StopFilter keyword aware, similar to PorterStemFilter

2011-06-23 Thread Sujit Pal (JIRA)

Make LowerCaseFilter and StopFilter keyword aware, similar to PorterStemFilter
--

 Key: LUCENE-3236
 URL: https://issues.apache.org/jira/browse/LUCENE-3236
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/analysis
Affects Versions: 4.0
 Environment: N/A
Reporter: Sujit Pal
Priority: Minor
 Fix For: 4.0


PorterStemFilter has functionality to detect if a term has been marked as a 
keyword by the KeywordMarkerFilter (KeywordAttribute.isKeyword() == true), 
and if so, skip stemming.

The suggestion is to have the same functionality in other filters where it is 
applicable. I think it may be particularly applicable to the LowerCaseFilter 
(ie if it is a keyword, don't mess with the case), and StopFilter (if it is a 
keyword, then don't filter it out even if it looks like a stop word).

Backward compatibility is maintained (in both cases) by adding a new 
constructor which takes an additional boolean parameter ignoreKeyword. The 
current constructor will call this new constructor with ignoreKeyword = false.

Patches are attached (for LowerCaseFilter and StopFilter).

I have verified that the analysis JUnit tests run against the updated code, ie, 
backward compatibility is maintained.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3236) Make LowerCaseFilter and StopFilter keyword aware, similar to PorterStemFilter

2011-06-23 Thread Sujit Pal (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-3236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sujit Pal updated LUCENE-3236:
--

Attachment: lucene-3236-patch.diff

Patch generated with svn diff from the top level Lucene/Solr trunk. Contains
updates to LowerCaseFilter and StopFilter to recognize and NOT operate on terms
marked with KeywordAttribute.isKeyword.
(NOTE: also contains changes to changes2html.pl which seem to have been
generated automatically).

Make LowerCaseFilter and StopFilter keyword aware, similar to PorterStemFilter
--

Key: LUCENE-3236
URL: https://issues.apache.org/jira/browse/LUCENE-3236
Project: Lucene - Java
Issue Type: Improvement
Components: modules/analysis
Affects Versions: 4.0
Environment: N/A
Reporter: Sujit Pal
Priority: Minor
Labels: analysis
Fix For: 4.0

Attachments: lucene-3236-patch.diff

PorterStemFilter has functionality to detect if a term has been marked as a
keyword by the KeywordMarkerFilter (KeywordAttribute.isKeyword() == true),
and if so, skip stemming.
The suggestion is to have the same functionality in other filters where it is
applicable. I think it may be particularly applicable to the LowerCaseFilter
(ie if it is a keyword, don't mess with the case), and StopFilter (if it is a
keyword, then don't filter it out even if it looks like a stop word).
Backward compatibility is maintained (in both cases) by adding a new
constructor which takes an additional boolean parameter ignoreKeyword. The
current constructor will call this new constructor with ignoreKeyword = false.
Patches are attached (for LowerCaseFilter and StopFilter).
I have verified that the analysis JUnit tests run against the updated code,
ie, backward compatibility is maintained.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2382) DIH Cache Improvements

2011-06-23 Thread James Dyer (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054104#comment-13054104
]

James Dyer commented on SOLR-2382:
--

{quote}
The DIHCache should take the Context as a param and the EntityProcessor does
not need to make a copy of the attributes
{quote}

I started down this road this afternoon hoping to have you another patch
version to look at today. But it turns out to be more complicated than I first
anticipated. Any ideas how to get around these difficulties?

- DIHCacheProcessor enforces readOnly=false and deletePriorData=true. It
also modifies the cacheName if the user specifies partitions. Seeing that
Context-Entity-Attributes are immutable, should I pass these as
Entity-Scope-Session-Attributes?

- cacheInit() in EntityProcessorBase specifically passes only the parameters
that apply to the current situation. This way, if a user applies something
non-applicable they are safely ignored rather than getting undefined behavior.
Just forwarding the context on doesn't give this flexibility. Do you think its
ok to just forward on the context anyway?

- DocBuilder instantiates DIHCacheWriter which in turn gets the user-specified
Cache Implementation and instantiates that. At this point in time, there
doesn't seem to be a Context to pass. So, rather than do this in the
constructor, is there a safer place down the road where I should be
instantiating the DIHCacheWriter?

I realize that its more lines of code to always copy these properties into a
property map to send to the cache, but I was looking at the cache at being a
layer down in the stack and maybe it shouldn't have the whole context sent to
it. What do you think?

DIH Cache Improvements
--

Functionality:
1. Provide a pluggable caching framework for DIH so that users can choose a
cache implementation that best suits their data and application.

2. Provide a means to temporarily cache a child Entity's data without
needing to create a special cached implementation of the Entity Processor
(such as CachedSqlEntityProcessor).

2. We needed the ability to gather data from long-running entities by a
process that runs separate from our main indexing process.

4. We want the ability to index several documents in parallel (using 1.4.1,
which did not have the threads parameter).

2. Allow Entity Processors to take a

[jira] [Commented] (LUCENE-3232) Move MutableValues to Common Module

2011-06-23 Thread Chris Male (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-3232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054110#comment-13054110
]

Chris Male commented on LUCENE-3232:

{quote}
I wonder if we should name this module something more specific, eg docvalues?
values?

Should we also move over ValueSource, *DocValues, FieldCacheSource? I think,
then, Solr 3.x grouping could cutover and then group by other field types.
{quote}

To be honest, that wasn't my plan :D

My plan is to first move these to a Common module which will serve basically as
a utility module for other modules. The MutableValue classes are useful in a
number of places (or will be in the future). I envisage other useful utility
like classes going into this module in the future too. Solr for example has a
number of very useful utilities that might be of benefit.

As such, it doesn't really relate to FunctionQuerys or ValueSources.

The next step once this is complete is to do what I originally intended and
make a Queries module and push FunctionQuery and all the ValueSources /
DocValues into that.

In the end you get the following structure:

modules/
common/
(MutableValue*)
queries/
(FunctionQuery, *DocValues, *ValueSource, Queries from contrib/queries)

Seem reasonable?

Move MutableValues to Common Module
---

Key: LUCENE-3232
URL: https://issues.apache.org/jira/browse/LUCENE-3232
Project: Lucene - Java
Issue Type: Sub-task
Components: core/search
Reporter: Chris Male
Fix For: 4.0

Attachments: LUCENE-3232.patch, LUCENE-3232.patch

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Issue Comment Edited] (LUCENE-3232) Move MutableValues to Common Module

2011-06-23 Thread Chris Male (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-3232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054110#comment-13054110
]

Chris Male edited comment on LUCENE-3232 at 6/23/11 9:28 PM:
-

{quote}
I wonder if we should name this module something more specific, eg docvalues?
values?

Should we also move over ValueSource, *DocValues, FieldCacheSource? I think,
then, Solr 3.x grouping could cutover and then group by other field types.
{quote}

To be honest, that wasn't my plan :D

As such, it doesn't really relate to FunctionQuerys or ValueSources.

The next step once this is complete is to do what I originally intended and
make a Queries module and push FunctionQuery and all the ValueSources /
DocValues into that.

In the end you get the following structure:

{code}
modules/
common/
(MutableValue*)
queries/
(FunctionQuery, *DocValues, *ValueSource, Queries from contrib/queries)
{code}

Seem reasonable?

was (Author: cmale):
{quote}
I wonder if we should name this module something more specific, eg docvalues?
values?

Should we also move over ValueSource, *DocValues, FieldCacheSource? I think,
then, Solr 3.x grouping could cutover and then group by other field types.
{quote}

To be honest, that wasn't my plan :D

As such, it doesn't really relate to FunctionQuerys or ValueSources.

The next step once this is complete is to do what I originally intended and
make a Queries module and push FunctionQuery and all the ValueSources /
DocValues into that.

In the end you get the following structure:

modules/
common/
(MutableValue*)
queries/
(FunctionQuery, *DocValues, *ValueSource, Queries from contrib/queries)

Seem reasonable?

Move MutableValues to Common Module
---

Key: LUCENE-3232
URL: https://issues.apache.org/jira/browse/LUCENE-3232
Project: Lucene - Java
Issue Type: Sub-task
Components: core/search
Reporter: Chris Male
Fix For: 4.0

Attachments: LUCENE-3232.patch, LUCENE-3232.patch

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter

[
https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054114#comment-13054114
]

Robert Muir commented on LUCENE-3234:
-

You can change it if you don't mind. However, I think I agree it would be good
to figure out if there is an n^2 here. This might have some affect on what the
default value should be... ideally there is some way we could fix the n^2.

Is there a way to turn your test case into a benchmark, or do you have a
separate benchmark (the example you mentioned where it blows up really bad).
This could help in looking at what's going on.

Provide limit on phrase analysis in FastVectorHighlighter
-

Key: LUCENE-3234
URL: https://issues.apache.org/jira/browse/LUCENE-3234
Project: Lucene - Java
Issue Type: Improvement
Reporter: Mike Sokolov
Attachments: LUCENE-3234.patch

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter

[
https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054125#comment-13054125
]

Mike Sokolov commented on LUCENE-3234:
--

I don't think I can share the test documents I have - they belong to someone
else. I can look at trying to make something bad happen with the wikipedia
data, but I'm curious why a benchmark is preferable to a test case?

Provide limit on phrase analysis in FastVectorHighlighter
-

Key: LUCENE-3234
URL: https://issues.apache.org/jira/browse/LUCENE-3234
Project: Lucene - Java
Issue Type: Improvement
Reporter: Mike Sokolov
Attachments: LUCENE-3234.patch

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter

[
https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054133#comment-13054133
]

Robert Muir commented on LUCENE-3234:
-

oh thats ok, i just meant a little tiny benchmark, hitting the nasty case that
we might think might be n^2.
If the little test case does that... then that will work, just wasn't sure if
it did.

either way just something to look at in the profiler, etc.

Provide limit on phrase analysis in FastVectorHighlighter
-

Key: LUCENE-3234
URL: https://issues.apache.org/jira/browse/LUCENE-3234
Project: Lucene - Java
Issue Type: Improvement
Reporter: Mike Sokolov
Attachments: LUCENE-3234.patch

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter

[
https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054162#comment-13054162
]

Mike Sokolov commented on LUCENE-3234:
--

I did go back and look at the original case that made me worried; in that case
the bad document is 650K, and the matched term occurs 23000 times in it. The
search still finishes in 24 sec or so on my desktop, which isn't too bad I
guess, considering.

After looking at that and measuring the change in the test case in the patch as
the number of terms increase, I don't think there actually is an n^2 - just
linear, but the growth is still enough that the patch has value. The test case
in the patch is closely targeted at the method which takes all the time when
you have large numbers of matching terms in a single document.

Provide limit on phrase analysis in FastVectorHighlighter
-

Key: LUCENE-3234
URL: https://issues.apache.org/jira/browse/LUCENE-3234
Project: Lucene - Java
Issue Type: Improvement
Reporter: Mike Sokolov
Attachments: LUCENE-3234.patch

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: [VOTE] release 3.3

2011-06-23 Thread Chris Male

+1

Thanks for pulling the release together Robert.

On Fri, Jun 24, 2011 at 9:33 AM, Michael McCandless 
luc...@mikemccandless.com wrote:

 +1

 Smoke testing passed for me, except for the Java 1.5 only hang in
 TestDoubleBarrelLRUCache (LUCENE-3235) but I don't think that should
 block the release.

 Mike McCandless

 http://blog.mikemccandless.com

 On Thu, Jun 23, 2011 at 4:18 PM, Robert Muir rcm...@gmail.com wrote:
  Artifacts here:
 
  http://s.apache.org/lusolr33rc0
 
  working release notes here:
 
  http://wiki.apache.org/lucene-java/ReleaseNote33
  http://wiki.apache.org/solr/ReleaseNote33
 
  I ran the automated release test script in
  trunk/dev-tools/scripts/smokeTestRelease.py, and ran 'ant test' at the
  top level 50 times on windows.
  Here is my +1
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: dev-h...@lucene.apache.org
 
 

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




-- 
Chris Male | Software Developer | JTeam BV.| www.jteam.nl

[jira] [Commented] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter

2011-06-23 Thread Koji Sekiguchi (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054193#comment-13054193
]

Koji Sekiguchi commented on LUCENE-3234:

Mike, thank you for your continuous interest to FVH! Can you add the parameter
for Solr, with an appropriate default value if you would like. I don't know
assertTrue test in testManyRepeatedTerms() is ok, for JENKINS?

Provide limit on phrase analysis in FastVectorHighlighter
-

Key: LUCENE-3234
URL: https://issues.apache.org/jira/browse/LUCENE-3234
Project: Lucene - Java
Issue Type: Improvement
Affects Versions: 2.9.4, 3.0.3, 3.1, 3.2, 3.3
Reporter: Mike Sokolov
Fix For: 3.4, 4.0

Attachments: LUCENE-3234.patch

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter

2011-06-23 Thread Koji Sekiguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi updated LUCENE-3234:
---

Affects Version/s: 3.3
   2.9.4
   3.0.3
   3.1
   3.2
Fix Version/s: 4.0
   3.4

 Provide limit on phrase analysis in FastVectorHighlighter
 -

 Key: LUCENE-3234
 URL: https://issues.apache.org/jira/browse/LUCENE-3234
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 2.9.4, 3.0.3, 3.1, 3.2, 3.3
Reporter: Mike Sokolov
Assignee: Koji Sekiguchi
 Fix For: 3.4, 4.0

 Attachments: LUCENE-3234.patch


 With larger documents, FVH can spend a lot of time trying to find the 
 best-scoring snippet as it examines every possible phrase formed from 
 matching terms in the document.  If one is willing to accept
 less-than-perfect scoring by limiting the number of phrases that are 
 examined, substantial speedups are possible.  This is analogous to the 
 Highlighter limit on the number of characters to analyze.
 The patch includes an artifical test case that shows  1000x speedup.  In a 
 more normal test environment, with English documents and random queries, I am 
 seeing speedups of around 3-10x when setting phraseLimit=1, which has the 
 effect of selecting the first possible snippet in the document.  Most of our 
 sites operate in this way (just show the first snippet), so this would be a 
 big win for us.
 With phraseLimit = -1, you get the existing FVH behavior. At larger values of 
 phraseLimit, you may not get substantial speedup in the normal case, but you 
 do get the benefit of protection against blow-up in pathological cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Assigned] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter

2011-06-23 Thread Koji Sekiguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi reassigned LUCENE-3234:
--

Assignee: Koji Sekiguchi

 Provide limit on phrase analysis in FastVectorHighlighter
 -

 Key: LUCENE-3234
 URL: https://issues.apache.org/jira/browse/LUCENE-3234
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 2.9.4, 3.0.3, 3.1, 3.2, 3.3
Reporter: Mike Sokolov
Assignee: Koji Sekiguchi
 Fix For: 3.4, 4.0

 Attachments: LUCENE-3234.patch


 With larger documents, FVH can spend a lot of time trying to find the 
 best-scoring snippet as it examines every possible phrase formed from 
 matching terms in the document.  If one is willing to accept
 less-than-perfect scoring by limiting the number of phrases that are 
 examined, substantial speedups are possible.  This is analogous to the 
 Highlighter limit on the number of characters to analyze.
 The patch includes an artifical test case that shows  1000x speedup.  In a 
 more normal test environment, with English documents and random queries, I am 
 seeing speedups of around 3-10x when setting phraseLimit=1, which has the 
 effect of selecting the first possible snippet in the document.  Most of our 
 sites operate in this way (just show the first snippet), so this would be a 
 big win for us.
 With phraseLimit = -1, you get the existing FVH behavior. At larger values of 
 phraseLimit, you may not get substantial speedup in the normal case, but you 
 do get the benefit of protection against blow-up in pathological cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-2429) ability to not cache a filter

2011-06-23 Thread Yonik Seeley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley updated SOLR-2429:
---

Attachment: SOLR-2429.patch

Here's a patch that allows one to add cache=false to top level queries (main 
queries, filter queries, facet queries, etc).

Currently (without this patch) Solr generates the set of documents that match 
each filter individually (this is so they can be cached and reused).

Adding cache=false to the main query prevents lookup/storing in the query 
cache.  Adding cache=false to any filter query causes the filterCache to not be 
used.  Further, the filter query is actually run in parallel to the main query 
and any other non-cached filter queries (which can speed things up if the base 
query or other filter queries are relatively sparse).

There is also an optional cost parameter that controls the order in which 
non-cached filter queries are evaluated so knowledgable users can order less 
expensive non-cached filters before expensive non-cached filters.

As an additional feature for very high cost filters, if cache=false and 
cost=100 and the query implements the PostFilter interface, a Collector will 
be requested from that query and used to filter documents after they have 
matched the main query and all other filter queries.  There can be multiple 
post filters, and they are also ordered by cost.

The frange query (a range over function queries, background here:
http://www.lucidimagination.com/blog/2009/07/06/ranges-over-functions-in-solr-14/
) also now implements PostFilter.

Examples:
{code}
// normal function range query used as a filter, all matching documents 
generated up front and cached
fq={!frange l=10 u=100}mul(popularity,price)

// function range query run in parallel with the main query like a traditional 
lucene filter
fq={!frange l=10 u=100 cache=false}mul(popularity,price)

// function range query checked after each document that already matches the 
query and all other filters.  Good for really expensive function queries. 
fq={!frange l=10 u=100 cache=false cost=100}mul(popularity,price)
{code}

 ability to not cache a filter
 -

 Key: SOLR-2429
 URL: https://issues.apache.org/jira/browse/SOLR-2429
 Project: Solr
  Issue Type: New Feature
Reporter: Yonik Seeley
 Attachments: SOLR-2429.patch


 A user should be able to add {!cache=false} to a query or filter query.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter


 [ 
https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Sokolov updated LUCENE-3234:
-

Attachment: LUCENE-3234.patch

Added solr parameter hl.phraseLimit (default=5000)

 Provide limit on phrase analysis in FastVectorHighlighter
 -

 Key: LUCENE-3234
 URL: https://issues.apache.org/jira/browse/LUCENE-3234
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 2.9.4, 3.0.3, 3.1, 3.2, 3.3
Reporter: Mike Sokolov
Assignee: Koji Sekiguchi
 Fix For: 3.4, 4.0

 Attachments: LUCENE-3234.patch, LUCENE-3234.patch


 With larger documents, FVH can spend a lot of time trying to find the 
 best-scoring snippet as it examines every possible phrase formed from 
 matching terms in the document.  If one is willing to accept
 less-than-perfect scoring by limiting the number of phrases that are 
 examined, substantial speedups are possible.  This is analogous to the 
 Highlighter limit on the number of characters to analyze.
 The patch includes an artifical test case that shows  1000x speedup.  In a 
 more normal test environment, with English documents and random queries, I am 
 seeing speedups of around 3-10x when setting phraseLimit=1, which has the 
 effect of selecting the first possible snippet in the document.  Most of our 
 sites operate in this way (just show the first snippet), so this would be a 
 big win for us.
 With phraseLimit = -1, you get the existing FVH behavior. At larger values of 
 phraseLimit, you may not get substantial speedup in the normal case, but you 
 do get the benefit of protection against blow-up in pathological cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter


 [ 
https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Sokolov updated LUCENE-3234:
-

Attachment: LUCENE-3234.patch

 Provide limit on phrase analysis in FastVectorHighlighter
 -

 Key: LUCENE-3234
 URL: https://issues.apache.org/jira/browse/LUCENE-3234
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 2.9.4, 3.0.3, 3.1, 3.2, 3.3
Reporter: Mike Sokolov
Assignee: Koji Sekiguchi
 Fix For: 3.4, 4.0

 Attachments: LUCENE-3234.patch, LUCENE-3234.patch


 With larger documents, FVH can spend a lot of time trying to find the 
 best-scoring snippet as it examines every possible phrase formed from 
 matching terms in the document.  If one is willing to accept
 less-than-perfect scoring by limiting the number of phrases that are 
 examined, substantial speedups are possible.  This is analogous to the 
 Highlighter limit on the number of characters to analyze.
 The patch includes an artifical test case that shows  1000x speedup.  In a 
 more normal test environment, with English documents and random queries, I am 
 seeing speedups of around 3-10x when setting phraseLimit=1, which has the 
 effect of selecting the first possible snippet in the document.  Most of our 
 sites operate in this way (just show the first snippet), so this would be a 
 big win for us.
 With phraseLimit = -1, you get the existing FVH behavior. At larger values of 
 phraseLimit, you may not get substantial speedup in the normal case, but you 
 do get the benefit of protection against blow-up in pathological cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter


 [ 
https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Sokolov updated LUCENE-3234:
-

Attachment: (was: LUCENE-3234.patch)

 Provide limit on phrase analysis in FastVectorHighlighter
 -

 Key: LUCENE-3234
 URL: https://issues.apache.org/jira/browse/LUCENE-3234
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 2.9.4, 3.0.3, 3.1, 3.2, 3.3
Reporter: Mike Sokolov
Assignee: Koji Sekiguchi
 Fix For: 3.4, 4.0

 Attachments: LUCENE-3234.patch, LUCENE-3234.patch


 With larger documents, FVH can spend a lot of time trying to find the 
 best-scoring snippet as it examines every possible phrase formed from 
 matching terms in the document.  If one is willing to accept
 less-than-perfect scoring by limiting the number of phrases that are 
 examined, substantial speedups are possible.  This is analogous to the 
 Highlighter limit on the number of characters to analyze.
 The patch includes an artifical test case that shows  1000x speedup.  In a 
 more normal test environment, with English documents and random queries, I am 
 seeing speedups of around 3-10x when setting phraseLimit=1, which has the 
 effect of selecting the first possible snippet in the document.  Most of our 
 sites operate in this way (just show the first snippet), so this would be a 
 big win for us.
 With phraseLimit = -1, you get the existing FVH behavior. At larger values of 
 phraseLimit, you may not get substantial speedup in the normal case, but you 
 do get the benefit of protection against blow-up in pathological cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Issue Comment Edited] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter