[jira] Commented: (LUCENENET-311) TestNRTReaderWithThreads.TestIndexing

2010-09-11 Thread Kevin Dotzenrod (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENENET-311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12908358#action_12908358
 ] 

Kevin Dotzenrod commented on LUCENENET-311:
---

I'm running lucene.net 2.9.2 and am still getting this exception.  Do I need to 
apply this patch to 2.9.2 source?  after process termination the index is left 
in a corrupt state.

exception details:
Event Type:Error
Event Source:ASP.NET 2.0.50727.0
Event Category:None
Event ID:  1334
Date: 9/9/2010
Time: 12:42:11 AM
User: N/A
Computer:  SECDMDWAPPP02
Description:
An unhandled exception occurred and the process was terminated.

Application ID: DefaultDomain

Process ID: 4232

Exception: System.Runtime.Serialization.SerializationException

Message: Unable to find assembly 'Lucene.Net, Version=2.9.1.2, Culture=neutral, 
PublicKeyToken=null'.

StackTrace:at 
System.Runtime.Serialization.Formatters.Binary.BinaryAssemblyInfo.GetAssembly()
   at 
System.Runtime.Serialization.Formatters.Binary.ObjectReader.GetType(BinaryAssemblyInfo
 assemblyInfo, String name)
   at System.Runtime.Serialization.Formatters.Binary.ObjectMap..ctor(String 
objectName, String[] memberNames, BinaryTypeEnum[] binaryTypeEnumA, Object[] 
typeInformationA, Int32[] memberAssemIds, ObjectReader objectReader, Int32 
objectId, BinaryAssemblyInfo assemblyInfo, SizedArray assemIdToAssemblyTable)
   at System.Runtime.Serialization.Formatters.Binary.ObjectMap.Create(String 
name, String[] memberNames, BinaryTypeEnum[] binaryTypeEnumA, Object[] 
typeInformationA, Int32[] memberAssemIds, ObjectReader objectReader, Int32 
objectId, BinaryAssemblyInfo assemblyInfo, SizedArray assemIdToAssemblyTable)
   at 
System.Runtime.Serialization.Formatters.Binary.__BinaryParser.ReadObjectWithMapTyped(BinaryObjectWithMapTyped
 record)
   at 
System.Runtime.Serialization.Formatters.Binary.__BinaryParser.ReadObjectWithMapTyped(BinaryHeaderEnum
 binaryHeaderEnum)
   at System.Runtime.Serialization.Formatters.Binary.__BinaryParser.Run()
   at 
System.Runtime.Serialization.Formatters.Binary.ObjectReader.Deserialize(HeaderHandler
 handler, __BinaryParser serParser, Boolean fCheck, Boolean isCrossAppDomain, 
IMethodCallMessage methodCallMessage)
   at 
System.Runtime.Serialization.Formatters.Binary.BinaryFormatter.Deserialize(Stream
 serializationStream, HeaderHandler handler, Boolean fCheck, Boolean 
isCrossAppDomain, IMethodCallMessage methodCallMessage)
   at 
System.Runtime.Remoting.Channels.CrossAppDomainSerializer.DeserializeObject(MemoryStream
 stm)
   at System.AppDomain.Deserialize(Byte[] blob)
   at System.AppDomain.UnmarshalObject(Byte[] blob)


 TestNRTReaderWithThreads.TestIndexing
 -

 Key: LUCENENET-311
 URL: https://issues.apache.org/jira/browse/LUCENENET-311
 Project: Lucene.Net
  Issue Type: Bug
Reporter: Digy
 Attachments: LUCENENET-311.patch, LUCENENET-311.patch, 
 LUCENENET-311.patch


 The problem is at TearDown. Because of harmless exceptions, It fails.
 (Lucene.Net.Index.TestNRTReaderWithThreads.TestIndexing:
 An unhandled System.Runtime.Serialization.SerializationException was thrown 
 while executing this test : Unable to find assembly 'Lucene.Net, 
 Version=2.9.1.1, Culture=neutral, PublicKeyToken=null'.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Issue Comment Edited: (LUCENENET-311) TestNRTReaderWithThreads.TestIndexing

2010-09-11 Thread Kevin Dotzenrod (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENENET-311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12908358#action_12908358
 ] 

Kevin Dotzenrod edited comment on LUCENENET-311 at 9/11/10 3:46 PM:


I'm running lucene.net 2.9.2 and am still getting this exception.  Do I need to 
apply this patch to 2.9.2 source?  after process termination the index is left 
in a corrupt state.

exception details:
Event Type:Error
Event Source:ASP.NET 2.0.50727.0
Event Category:None
Event ID:  1334
Date: 9/9/2010
Time: 12:42:11 AM
User: N/A
Computer:  SECDMDWAPPP02
Description:
An unhandled exception occurred and the process was terminated.

Application ID: DefaultDomain

Process ID: 4232

Exception: System.Runtime.Serialization.SerializationException

Message: Unable to find assembly 'Lucene.Net, Version=2.9.2.2, Culture=neutral, 
PublicKeyToken=null'.

StackTrace:at 
System.Runtime.Serialization.Formatters.Binary.BinaryAssemblyInfo.GetAssembly()
   at 
System.Runtime.Serialization.Formatters.Binary.ObjectReader.GetType(BinaryAssemblyInfo
 assemblyInfo, String name)
   at System.Runtime.Serialization.Formatters.Binary.ObjectMap..ctor(String 
objectName, String[] memberNames, BinaryTypeEnum[] binaryTypeEnumA, Object[] 
typeInformationA, Int32[] memberAssemIds, ObjectReader objectReader, Int32 
objectId, BinaryAssemblyInfo assemblyInfo, SizedArray assemIdToAssemblyTable)
   at System.Runtime.Serialization.Formatters.Binary.ObjectMap.Create(String 
name, String[] memberNames, BinaryTypeEnum[] binaryTypeEnumA, Object[] 
typeInformationA, Int32[] memberAssemIds, ObjectReader objectReader, Int32 
objectId, BinaryAssemblyInfo assemblyInfo, SizedArray assemIdToAssemblyTable)
   at 
System.Runtime.Serialization.Formatters.Binary.__BinaryParser.ReadObjectWithMapTyped(BinaryObjectWithMapTyped
 record)
   at 
System.Runtime.Serialization.Formatters.Binary.__BinaryParser.ReadObjectWithMapTyped(BinaryHeaderEnum
 binaryHeaderEnum)
   at System.Runtime.Serialization.Formatters.Binary.__BinaryParser.Run()
   at 
System.Runtime.Serialization.Formatters.Binary.ObjectReader.Deserialize(HeaderHandler
 handler, __BinaryParser serParser, Boolean fCheck, Boolean isCrossAppDomain, 
IMethodCallMessage methodCallMessage)
   at 
System.Runtime.Serialization.Formatters.Binary.BinaryFormatter.Deserialize(Stream
 serializationStream, HeaderHandler handler, Boolean fCheck, Boolean 
isCrossAppDomain, IMethodCallMessage methodCallMessage)
   at 
System.Runtime.Remoting.Channels.CrossAppDomainSerializer.DeserializeObject(MemoryStream
 stm)
   at System.AppDomain.Deserialize(Byte[] blob)
   at System.AppDomain.UnmarshalObject(Byte[] blob)


  was (Author: kdotzenrod):
I'm running lucene.net 2.9.2 and am still getting this exception.  Do I 
need to apply this patch to 2.9.2 source?  after process termination the index 
is left in a corrupt state.

exception details:
Event Type:Error
Event Source:ASP.NET 2.0.50727.0
Event Category:None
Event ID:  1334
Date: 9/9/2010
Time: 12:42:11 AM
User: N/A
Computer:  SECDMDWAPPP02
Description:
An unhandled exception occurred and the process was terminated.

Application ID: DefaultDomain

Process ID: 4232

Exception: System.Runtime.Serialization.SerializationException

Message: Unable to find assembly 'Lucene.Net, Version=2.9.1.2, Culture=neutral, 
PublicKeyToken=null'.

StackTrace:at 
System.Runtime.Serialization.Formatters.Binary.BinaryAssemblyInfo.GetAssembly()
   at 
System.Runtime.Serialization.Formatters.Binary.ObjectReader.GetType(BinaryAssemblyInfo
 assemblyInfo, String name)
   at System.Runtime.Serialization.Formatters.Binary.ObjectMap..ctor(String 
objectName, String[] memberNames, BinaryTypeEnum[] binaryTypeEnumA, Object[] 
typeInformationA, Int32[] memberAssemIds, ObjectReader objectReader, Int32 
objectId, BinaryAssemblyInfo assemblyInfo, SizedArray assemIdToAssemblyTable)
   at System.Runtime.Serialization.Formatters.Binary.ObjectMap.Create(String 
name, String[] memberNames, BinaryTypeEnum[] binaryTypeEnumA, Object[] 
typeInformationA, Int32[] memberAssemIds, ObjectReader objectReader, Int32 
objectId, BinaryAssemblyInfo assemblyInfo, SizedArray assemIdToAssemblyTable)
   at 
System.Runtime.Serialization.Formatters.Binary.__BinaryParser.ReadObjectWithMapTyped(BinaryObjectWithMapTyped
 record)
   at 
System.Runtime.Serialization.Formatters.Binary.__BinaryParser.ReadObjectWithMapTyped(BinaryHeaderEnum
 binaryHeaderEnum)
   at System.Runtime.Serialization.Formatters.Binary.__BinaryParser.Run()
   at 
System.Runtime.Serialization.Formatters.Binary.ObjectReader.Deserialize(HeaderHandler
 handler, __BinaryParser serParser, Boolean fCheck, Boolean isCrossAppDomain, 

Board Report for Sept.

2010-09-11 Thread Grant Ingersoll
I intend to file the following Board Report for September tomorrow (Sept. 12) 
unless I hear corrections, etc.  In particular, Lucene.NET needs to step up and 
report, as I have not heard from them.

=== Lucene Status Report: Sept, 2010 ===

TLP

The Lucy project has been moved to Incubator where it intends to become a 
TLP.


LUCENE JAVA/Solr

Lucene Java is a search-engine toolkit and Solr is a search server
built on top of Lucene. The community is very active.
The community is working towards a 2.9.3, 3.0.2 and 4.0 release.

LUCENE.NET

Lucene.NET is a .NET based port of Lucene Java. Development appears
to have stagnated and the PMC is beginning to look into issues here.

Open Relevance Project

The Open Relevance Project is a project aimed at providing Lucene
and others tools for judging the quality of search and machine
learning approaches.  The community is not very active, but
we don't expect it to be very high volume either.  The community
has started some discussion around what goals the project should
have.

PyLucene

PyLucene is a Python integration of Lucene Java.  Development is
active. PyLucene 3.0.2-1 and 2.9.3-1 were released on July 3rd, 2010.
As a development milestone, experimental Python 3.1.2 ports of PyLucene
and JCC were completed July 12th, 2010.

Re: Relevancy, Phrase Boosting, Shingles and Long Tail Curves

2010-09-11 Thread mark harwood
What is the best practices formula for determining above average 
correlations 
of adjacent terms

I gave this some thought in https://issues.apache.org/jira/browse/LUCENE-474
I found the Jaccard cooefficient favoured rare words too strongly and so went 
for a blend as shown below:


public float getScore()
{
float overallIntersectionPercent = coIncidenceDocCount
/ (float) (termADocFreq + termBDocFreq);
float termBIntersectionPercent = coIncidenceDocCount
/ (float) (termBDocFreq);

//using just the termB intersection favours common words as
// coincidents eg new food
//  return termBIntersectionPercent;
//using just the overall intersection favours rare words as
// coincidents eg scezchuan food
//return overallIntersectionPercent;
// so here we take an average of the two:
return (termBIntersectionPercent + overallIntersectionPercent) / 2;
}





From: Mark Bennett mbenn...@ideaeng.com
To: dev@lucene.apache.org
Sent: Fri, 10 September, 2010 18:44:31
Subject: Re: Relevancy, Phrase Boosting, Shingles and Long Tail Curves

Thanks Mark H,

Maybe I'll look at MLT (More Like This) again.  I'll also check out zipf.

It's claimed that Question and Answer wording is different enough for generic 
text content that different techniques might be indicated. From what I remember:
1: Though nouns normally convey 60% of relevancy in general text, QA content 
is 
skewed a bit more towards verbs.
2: Questions may contain more noise words (though perhaps in useful groupings)
3: Vocabulary mismatch of Interrogative vs. declarative / narrative (Q vs A)
4: Vocabulary mismatch of novices vs experts (Q vs A)

It was item 2 that I was hoping to capitalize on with NGrams / Shingles.

Still waiting for the relevancy math nerds to chime in about the log-log and 
IDF 
stuff ... ;-)

I was thinking a bit more about the math involved here

What is the best practices formula for determining above average correlations 
of adjacent terms, beyond what random chance would give. So you notice that 
white and house appear next to each other more than what chance 
distribution 
would explain, so you decide it's an important NGram.

The noise floor isn't too bad for the typical shopping cart items calculation.
You analyze the items present or not present in 1,000 shopping cart receipts.
If grocery items were completely independent then random level is  just 
the odds of the 2 items multiplied together:
1,000 shopping carts
200 have cereal
250 have milk
chance of
cereal = 200/1,000 = 20%
milk = 250/1,000 = 25%
IF independent then
P(cereal AND milk) = P(cereal) * P(milk)
20% * 25% = 5%
So 50 carts likely to have both cereal and milk
And if MORE than 50 carts have cereal and milk, then it's worth  noting.
The classic example is diapers and beer, which is a bit apocryphal and NOT 
expected, but I like the breakfast cereal and milk example better because it IS 
expected.

Now back to word-A appearing directly before word-B, and finding the base level 
number of times you'd expect just from random chance.

Although Lucene/Luke gives you total word instances and document counts, what 
you'd really want is the number of possible N-Grams, which is affected by 
document boundaries, so it gets a little weird.

Some other differences between the word-A word-B calculation vs milk and cereal:
1: I want ordered pairs, white before house
2: A document is NOT like a shopping cart in that I DO care how many times 
white appears before house, whereas in the shopping carts I only cared 
about 
present or not present, so document count is less helpful here.

I'm sure some companies and PHD's have super secret formulas for this, but I'd 
be content to just compare it to baseline random chance.

Mark B

--
Mark Bennett / New Idea Engineering, Inc. / mbenn...@ideaeng.com
Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513



On Fri, Sep 10, 2010 at 3:17 AM, mark harwood markharw...@yahoo.co.uk wrote:

Hi Mark
I've played with Shingles recently in some auto-categorisation work where my 
starting assumption was that multi-word terms will hold more information value 
than individual words and that phrase queries on seperate terms will not give 
these term combos their true reward (in terms of IDF) - or if they did compute 
the true IDF,  would require lots of disk IO to do this. Shingles present a 
conveniently pre-aggregated score for these combos.
Looking at the results of MoreLikeThis queries based on a shingling analyzers 
the results I saw generally seemed good but did not formally bench mark this 
against non-shingled indexes. Not everything was rosy in that I did see some 
tendency to over-reward certain rare shingles  (e.g. a shared mention of New 
Years Eve Party pulled otherwise mostly unrelated news 

[jira] Commented: (LUCENE-2504) sorting performance regression

2010-09-11 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12908336#action_12908336
 ] 

Yonik Seeley commented on LUCENE-2504:
--

bq. This is all quite silly: we are only doing this to game hotspot into 
properly inlining/compiling what is in fact an array lookup, just currently 
hidden behind method calls in the packed ints impls. We really shouldn't have 
to do this custom source code specialization.

Yeah, but this is the way hotspot currently works, and I don't know if there 
are any plans to change it.
Hotspot can be pretty aggressive at inlining, but then it deoptimizes when it 
turns out that the inline is no longer valid (because of a different 
implementation).

It's something worth keeping in mind for the rest of Lucene too - bothin 
benchmarking and design. Multiple implementations used from a single spot will 
not be inlined (if multiple implementations are actually used in the same run).


 sorting performance regression
 --

 Key: LUCENE-2504
 URL: https://issues.apache.org/jira/browse/LUCENE-2504
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 4.0
Reporter: Yonik Seeley
 Fix For: 4.0

 Attachments: LUCENE-2504.patch, LUCENE-2504.zip


 sorting can be much slower on trunk than branch_3x

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: lia2 tests for branch_3x ?

2010-09-11 Thread Marvin Humphrey
On Fri, Sep 10, 2010 at 06:21:03PM -0400, Michael McCandless wrote:
 On Fri, Sep 10, 2010 at 6:08 PM, Uwe Schindler u...@thetaphi.de wrote:
 
  Wouldn't be NOTICE.txt the right place for this?
 
 I think NOTICE.txt/LICENSE.txt is in order to reference the license of
 3rd party sources when they are incorporated?

How is this material coming in to Apache?  Is it being submitted directly to
the ASF by the copyright owner or owner's agent, in which case the following
applies?

http://www.apache.org/legal/src-headers.html#headers

I had thought that was the case, but if not, then this applies instead and I
believe usage is more constrained...

http://www.apache.org/legal/src-headers.html#3party

... though I'm not clear about exactly what the constraints are because the
license is ASL2.  If it were another license, then usage would definitely be
more constrained.

Regardless, NOTICE.txt isn't the place for a link advertising a book.

http://markmail.org/message/cxwtnuys65c7hs2y (Roy Fielding)

Hey, I'm all for people having opinions on development and credits and
documentation. NOTICE and LICENSE are none of those. They are not open to
anyone's opinions other than the copyright owners that require such notices,
and they must not be added where they are not required. Each additional 
notice
places a burden on the ASF and all downstream redistributors.

...

If you put stuff in NOTICE that is not legally required to be there, I will
remove it as an officer of the ASF. 

 LIA2's source code is already ASL2, though it is Copyright Manning so
 probably we will need to also put an entry in NOTICE.txt/LICENSE.txt.

It would be nice if that were not the case, because of the burden on
downstream.  Why don't IBM, Lucid, Twitter, and so on insist on having their
copyrights put into NOTICE.txt?  Managing credit on a collective project like
this is really hard.  IMO, to be fairest to everyone it's best to avoid the
issue altogether whenever possible.

Again, this in no way diminishes the value of Manning's potential contribution
or our gratitude for it.  I just hope Manning understands why accommodating
their request perhaps isn't as easy as it might have seemed from the outside.

Marvin Humphrey


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Resolved: (SOLR-2001) NPE using http://localhost:8983/solr/select/?q=

2010-09-11 Thread Yonik Seeley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley resolved SOLR-2001.


Fix Version/s: 4.0
   Resolution: Fixed

committed.

 NPE using http://localhost:8983/solr/select/?q=
 ---

 Key: SOLR-2001
 URL: https://issues.apache.org/jira/browse/SOLR-2001
 Project: Solr
  Issue Type: Bug
  Components: search
Affects Versions: 1.4.1
 Environment: http://localhost:8983/solr/select/?q=
Reporter: Sebb
 Fix For: 4.0

 Attachments: SOLR-2001.patch


 {code}
 null
 java.lang.NullPointerException
   at java.io.StringReader.init(StringReader.java:33)
   at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:197)
   at 
 org.apache.solr.search.LuceneQParser.parse(LuceneQParserPlugin.java:78)
   at org.apache.solr.search.QParser.getQuery(QParser.java:131)
   at 
 org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:89)
   at 
 org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:174)
   at 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
   at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
   at 
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
   at 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
   at 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
   at 
 org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
   at 
 org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
   at 
 org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
   at 
 org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
   at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
   at 
 org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
   at 
 org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
   at 
 org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
   at org.mortbay.jetty.Server.handle(Server.java:285)
   at 
 org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
   at 
 org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:821)
   at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:513)
   at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208)
   at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
   at 
 org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
   at 
 org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
 RequestURI=/solr/select/
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2504) sorting performance regression

2010-09-11 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12908357#action_12908357
 ] 

Yonik Seeley commented on LUCENE-2504:
--

I'm still seeing bad degredations in solr - I think it's because the default 
way for solr to sort strings is with MissingLastOrdComparator, which isn't 
specialized. I'll try and work up a patch based on Mike's work.

 sorting performance regression
 --

 Key: LUCENE-2504
 URL: https://issues.apache.org/jira/browse/LUCENE-2504
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 4.0
Reporter: Yonik Seeley
 Fix For: 4.0

 Attachments: LUCENE-2504.patch, LUCENE-2504.zip


 sorting can be much slower on trunk than branch_3x

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: IndexReader Cache - a different angle

2010-09-11 Thread Simon Willnauer
Hi Shai,

On Sat, Sep 11, 2010 at 8:08 PM, Shai Erera ser...@gmail.com wrote:
 Hi

 Lucene's Caches have been heavilydiscussed before (e.g., LUCENE-831,
 LUCENE-2133 and LUCENE-2394) and from what I can tell, there have been
 many proposals to attack this problem, w/ no developed solution.

I didn't go through those issues so forgive me if something I bring up
has already been discussed.
I have a couple of question about your proposal - please find them inline...


 I'd like to explore a different, IMO much simpler, angle to attach this
 problem. Instead of having Lucene manage the Cache itself, we let the
 application manage it, however Lucene will provide the necessary hooks
 in IndexReader to allow it. The hooks I have in mind are:

 (1) IndexReader current API for TermDocs, TermEnum, TermPositions etc. --
 already exists.

 (2) When reopen() is called, Lucene will take care to call a
 Cache.load(IndexReader), so that the application can pull whatever
 information
 it needs from the passed-in IndexReader.
Would that do anything else than passing the new reader (if reopened)
to the caches load method? I wonder if this is more than
If(newReader != oldReader)
  Cache.load(newReader)

If so something like that should be done on a segment reader anyway,
right? From my perspective this isn't more than a callback or visitor
that should walk though the subreaders and called for each reopened
sub-reader. A cache-warming visitor / callback would then be trivial
and the API would be more general.


 So to be more concrete on my proposal, I'd like to support caching in
 the following way (and while I've spent some time thinking about it, I'm
 sure there are great suggestions to improve it):

 * Application provides a CacheFactory to IndexReader.open/reopen, which
 exposes some very simple API, such as createCache, or
 initCache(IndexReader) etc. Something which returns a Cache object,
 which does not have very strict/concrete API.

My first question would be why the reader should know about Cache if
there is no strict / concrete API?
I can follow you with the CacheFactory to create cache objects but why
would the reader have to know / receive this object? Maybe this is
answered further down the path but I don't see the reason why the
notion of a cache must exist within open/reopen or if that could be
implemented in a more general cache - agnostic way.

 * IndexReader, most probably at the SegmentReader level uses
 CacheFactory to create a new Cache instance and calls its
 load(IndexReader) method, so that the Cache would initialize itself.
That is what I was thinking above - yet is that more than a callback
for each reopened or opened segment reader?


 * The application can use CacheFactory to obtain the Cache object per
 IndexReader (for example, during Collector.setNextReader), or we can
 have IndexReader offer a getCache() method.
:)  until here the cache is only used by the application itself not by
any Lucene API, right? I can think of many application specific data
that could be useful to be associated with an IR beyond the cacheing
use case - again this could be a more general API solving that
problem.

 * One of Cache API would be getCache(TYPE), where TYPE is a String or
 Object, or an interface CacheType w/ no methods, just to be a marker
 one, and the application is free to impl it however it wants. That's a
 loose API, I know, but completely at the application hands, which makes
 Lucene code simpler.
I like the idea together with the metadata associating functionality
from above something like public T IndexReader#get(TypeT type).
Hmm that looks quiet similar to Attributes, does it?! :) However this
could be done in many ways but again cache - agnositc

 * We can introduce a TermsCache, TermEnumCache and TermVectorCache to
 provide the user w/ IndexReader-similar API, only more efficient than
 say TermDocs -- something w/ random access to the docs inside, perhaps
 even an OpenBitSet. Lucene can take advantage of it if, say, we create a
 CachingSegmentReader which makes use of the cache, and checks every time
 termDocs() is called if the required Term is cached or not etc. I admit
 I may be thinking too much ahead.
I see what you are trying to do here. I also see how this could be
useful but I guess coming up with a stable APi which serves lots of
applications would be quiet hard. A CachingSegmentReader could be a
very simple decorator which would not require to touch the IR
interface. Something like that could be part of lucene but I'm not
sure if necessarily part of lucene core.

 That's more or less what I've been thinking. I'm sure there are many
 details to iron out, but I hope I've managed to pass the general
 proposal through to you.

Absolutely, this is how it works isn't it!


 What I'm after first, is to allow applications deal w/ postings caching more
 natively. For example, if you have a posting w/ payloads you'd like to
 read into memory, or if you would like a term's TermDocs to be cached

[jira] Created: (LUCENE-2640) add LuceneTestCase[J4].newField

2010-09-11 Thread Robert Muir (JIRA)
add LuceneTestCase[J4].newField
---

 Key: LUCENE-2640
 URL: https://issues.apache.org/jira/browse/LUCENE-2640
 Project: Lucene - Java
  Issue Type: Test
  Components: Tests
Reporter: Robert Muir
 Fix For: 3.1, 4.0
 Attachments: LUCENE-2640.patch

I think it would be good to vary the different field options in tests.

For example, we do this with IW settings (newIndexWriterConfig), and 
directories (newDirectory).

This patch adds newField(), it works just like new Field(), except it will 
sometimes turns on extra options:
Stored fields, term vectors, additional term vectors data, etc.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2640) add LuceneTestCase[J4].newField

2010-09-11 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-2640:


Attachment: LUCENE-2640.patch

attached is a patch, with all core tests converted (and passing)

we can always do the contrib tests later.


 add LuceneTestCase[J4].newField
 ---

 Key: LUCENE-2640
 URL: https://issues.apache.org/jira/browse/LUCENE-2640
 Project: Lucene - Java
  Issue Type: Test
  Components: Tests
Reporter: Robert Muir
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2640.patch


 I think it would be good to vary the different field options in tests.
 For example, we do this with IW settings (newIndexWriterConfig), and 
 directories (newDirectory).
 This patch adds newField(), it works just like new Field(), except it will 
 sometimes turns on extra options:
 Stored fields, term vectors, additional term vectors data, etc.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-2001) NPE using http://localhost:8983/solr/select/?q=

2010-09-11 Thread Lance Norskog (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12908370#action_12908370
 ] 

Lance Norskog commented on SOLR-2001:
-

+many

 NPE using http://localhost:8983/solr/select/?q=
 ---

 Key: SOLR-2001
 URL: https://issues.apache.org/jira/browse/SOLR-2001
 Project: Solr
  Issue Type: Bug
  Components: search
Affects Versions: 1.4.1
 Environment: http://localhost:8983/solr/select/?q=
Reporter: Sebb
 Fix For: 4.0

 Attachments: SOLR-2001.patch


 {code}
 null
 java.lang.NullPointerException
   at java.io.StringReader.init(StringReader.java:33)
   at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:197)
   at 
 org.apache.solr.search.LuceneQParser.parse(LuceneQParserPlugin.java:78)
   at org.apache.solr.search.QParser.getQuery(QParser.java:131)
   at 
 org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:89)
   at 
 org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:174)
   at 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
   at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
   at 
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
   at 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
   at 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
   at 
 org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
   at 
 org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
   at 
 org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
   at 
 org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
   at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
   at 
 org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
   at 
 org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
   at 
 org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
   at org.mortbay.jetty.Server.handle(Server.java:285)
   at 
 org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
   at 
 org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:821)
   at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:513)
   at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208)
   at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
   at 
 org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
   at 
 org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
 RequestURI=/solr/select/
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2640) add LuceneTestCase[J4].newField

2010-09-11 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12908371#action_12908371
 ] 

Uwe Schindler commented on LUCENE-2640:
---

In my opinion, TedstBackwardsCompatibility should not add random things in its 
document creation, as the zip files should be reproducible. If there are any 
random parts in it from previous parts, we should remove them.

I would revert the changes and any previous randomization in the parts that are 
respo nsible for the zip file creation.

 add LuceneTestCase[J4].newField
 ---

 Key: LUCENE-2640
 URL: https://issues.apache.org/jira/browse/LUCENE-2640
 Project: Lucene - Java
  Issue Type: Test
  Components: Tests
Reporter: Robert Muir
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2640.patch


 I think it would be good to vary the different field options in tests.
 For example, we do this with IW settings (newIndexWriterConfig), and 
 directories (newDirectory).
 This patch adds newField(), it works just like new Field(), except it will 
 sometimes turns on extra options:
 Stored fields, term vectors, additional term vectors data, etc.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2640) add LuceneTestCase[J4].newField

2010-09-11 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12908372#action_12908372
 ] 

Robert Muir commented on LUCENE-2640:
-

bq. I would revert the changes and any previous randomization in the parts that 
are respo nsible for the zip file creation.

Uwe, good catch. really though, the parts of the test that modify the index 
once opened should be randomized.

only createIndex() should have no randomization... I'll fix this.

 add LuceneTestCase[J4].newField
 ---

 Key: LUCENE-2640
 URL: https://issues.apache.org/jira/browse/LUCENE-2640
 Project: Lucene - Java
  Issue Type: Test
  Components: Tests
Reporter: Robert Muir
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2640.patch


 I think it would be good to vary the different field options in tests.
 For example, we do this with IW settings (newIndexWriterConfig), and 
 directories (newDirectory).
 This patch adds newField(), it works just like new Field(), except it will 
 sometimes turns on extra options:
 Stored fields, term vectors, additional term vectors data, etc.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2575) Concurrent byte and int block implementations

2010-09-11 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated LUCENE-2575:
-

Attachment: LUCENE-2575.patch

Here's a start at concurrency, the terms dictionary, and
iterating over doc ids. 

* It needs concurrency unit tests

* At an as yet undetermined interval, we need to conglomerate
the existing terms into a sorted int[] rather than continue to
use the ConcurrentSkipListMap, which consumes a far greater
amount of RAM. The tradeoff and reason for using the CSLM is the
level of concurrency gained by using it at the cost of greater
memory consumption when compared with the sorted int[] of term
ids.

* An int[] based term enum needs to be implemented. In addition,
a multi term enum, maybe there's one we can use, I'm not
familiar enough with the new flex code base.

* Copy on write is used to obtain a read-only version of the
ByteBlockPool and IntBlockPool. In the case of the byte blocks,
a boolean[] marks which elements need to be copied prior to
writing by the DocumentsWriterPerThread on byte slice forwarding
address rewrite.

* A write lock on each DWPT guarantees that as reference copies
are made, arrays being copied will not be altered in flight.
There shouldn't be an issue even though to get a complete
IndexReader[], we need to wait for each document to finish
flushing, we're not blocking indexing, only the obtaining of the
IRs. I can't see this being an issue for most use cases.

* Similarly, a reference is copied of the ParallelPostingsArray
(rather than a full copy) for use by the RAM Buffer based
IndexReader. It is OK for the PPA to be changed during future doc
adds, as the only the elements greater than the IRs max term id
will be altered, ie, we're not going to run into JMM thread
issues because the writing and read-only array reference copies
occur in a reentrant lock.

* Recycling of byte[]s becomes a bit more complex as RAM IRs will
likely hold references to them. When the RAM IR is closed, however,
the byte[]s can be recycled. The user could experience unusual
RAM usage spikes if IRs are not closed properly.



 Concurrent byte and int block implementations
 -

 Key: LUCENE-2575
 URL: https://issues.apache.org/jira/browse/LUCENE-2575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: Realtime Branch
Reporter: Jason Rutherglen
 Fix For: Realtime Branch

 Attachments: LUCENE-2575.patch


 The current *BlockPool implementations aren't quite concurrent.
 We really need something that has a locking flush method, where
 flush is called at the end of adding a document. Once flushed,
 the newly written data would be available to all other reading
 threads (ie, postings etc). I'm not sure I understand the slices
 concept, it seems like it'd be easier to implement a seekable
 random access file like API. One'd seek to a given position,
 then read or write from there. The underlying management of byte
 arrays could then be hidden?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Board Report for Sept.

2010-09-11 Thread Grant Ingersoll
I intend to file the following Board Report for September tomorrow (Sept. 12) 
unless I hear corrections, etc.  In particular, Lucene.NET needs to step up and 
report, as I have not heard from them.

=== Lucene Status Report: Sept, 2010 ===

TLP

The Lucy project has been moved to Incubator where it intends to become a 
TLP.


LUCENE JAVA/Solr

Lucene Java is a search-engine toolkit and Solr is a search server
built on top of Lucene. The community is very active.
The community is working towards a 2.9.3, 3.0.2 and 4.0 release.

LUCENE.NET

Lucene.NET is a .NET based port of Lucene Java. Development appears
to have stagnated and the PMC is beginning to look into issues here.

Open Relevance Project

The Open Relevance Project is a project aimed at providing Lucene
and others tools for judging the quality of search and machine
learning approaches.  The community is not very active, but
we don't expect it to be very high volume either.  The community
has started some discussion around what goals the project should
have.

PyLucene

PyLucene is a Python integration of Lucene Java.  Development is
active. PyLucene 3.0.2-1 and 2.9.3-1 were released on July 3rd, 2010.
As a development milestone, experimental Python 3.1.2 ports of PyLucene
and JCC were completed July 12th, 2010.
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2575) Concurrent byte and int block implementations

2010-09-11 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated LUCENE-2575:
-

Attachment: LUCENE-2575.patch

This includes a basic implementation of the sorted term id based
term enum. We'll want to over-allocate the sorted term id array
so that future merges of new term ids will not require
allocating a new array for growth. I think overall the ram
buffer based searching will not require too much more of a RAM
outlay. The merging of new term ids could occur in a background
thread if we think it's expensive, however for now we can simply
merge them in on demand as new RAM readers are created.

Seek is implemented as a binary search of the sorted term ids.
If this is not efficient enough, we can implement a terms index
ala the current system.

For now the conversion from CSLM to sorted term id array can be
a percentage of the total number of terms, which I'll default to
10%. We may want to make this a function (eg, percentage) of RAM
consumption in the future.

 Concurrent byte and int block implementations
 -

 Key: LUCENE-2575
 URL: https://issues.apache.org/jira/browse/LUCENE-2575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: Realtime Branch
Reporter: Jason Rutherglen
 Fix For: Realtime Branch

 Attachments: LUCENE-2575.patch, LUCENE-2575.patch


 The current *BlockPool implementations aren't quite concurrent.
 We really need something that has a locking flush method, where
 flush is called at the end of adding a document. Once flushed,
 the newly written data would be available to all other reading
 threads (ie, postings etc). I'm not sure I understand the slices
 concept, it seems like it'd be easier to implement a seekable
 random access file like API. One'd seek to a given position,
 then read or write from there. The underlying management of byte
 arrays could then be hidden?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: IndexReader Cache - a different angle

2010-09-11 Thread Shai Erera
Hey Simon,

You're right that the application can develop a Caching mechanism outside
Lucene, and when reopen() is called, if it changed, iterate on the
sub-readers and init the Cache w/ the new ones.

However, by building something like that inside Lucene, the application will
get more native support, and thus better performance, in some cases. For
example, consider a field fileType with 10 possible values, and for the sake
of simplicity, let's say that the index is divided evenly across them. Your
users always add such a term constraint to the query (e.g. they want to get
results of fileType:pdf or fileType:odt, and perhaps sometimes both, but not
others). You have basically two ways of supporting this:
(1) Add such a term to the query / clause to a BooleanQuery w/ an AND
relation -- cons is that this term / posting is read for every query.

(2) Write a Filter which works at the top IR level, that is refreshed
whenever the index is refreshed. This is better than (1), however has some
disadvantages:

(2.1) As Mike already proved (on some issue which I don't remember its
subject/number at the moment), if we could get Filter down to the lower
level components of Lucene's search, so e.g. it is used as the deleted docs
OBS, we can get better performance w/ Filters.

(2.2) The Filter is refreshed for the entire IR, and not just the changed
segments. Reason is, outside Collector, you have no way of telling
IndexSearcher use Filter F1 for segment S1 and F2 for segment F2.
Loading/refreshing the Filter may be expensive, and definitely won't perform
well w/ NRT, where by definition you'd like to get small changes searchable
very fast.

Therefore I think that if we could provide the necessary hooks in Lucene,
let's call it a Cache plug-in for now, we can incrementally improve the
search process. I don't want to go too far into the design of a generic
plug-ins mechanism, but you're right (again :)) -- we could offer a
reopen(PluginProvider) which is entirely not about Cache, and Cache would
become one of the Plugins the PluginProvider provides. I just try to learn
from past experience -- when the discussion is focused, there's a better
chance of getting to a resolution. However if you think that in this case, a
more generic API, as PluginProvider, would get us to a resolution faster, I
don't mind spend some time to think about it. But for all practical
purposes, we should IMO start w/ a Cache plug-in, that is called like that,
and if it catches, generify it ...

Unfortunately, I haven't had enough experience w/ Codecs yet (still on 3x)
so I can't comment on how feasible that solution is. I'll take your word for
it that it's doable :). But this doesn't give us a 3x solution ... the
Caching framework on trunk can be developed w/ Codecs.

Shai

On Sat, Sep 11, 2010 at 10:41 PM, Simon Willnauer 
simon.willna...@googlemail.com wrote:

 Hi Shai,

 On Sat, Sep 11, 2010 at 8:08 PM, Shai Erera ser...@gmail.com wrote:
  Hi
 
  Lucene's Caches have been heavilydiscussed before (e.g., LUCENE-831,
  LUCENE-2133 and LUCENE-2394) and from what I can tell, there have been
  many proposals to attack this problem, w/ no developed solution.

 I didn't go through those issues so forgive me if something I bring up
 has already been discussed.
 I have a couple of question about your proposal - please find them
 inline...

 
  I'd like to explore a different, IMO much simpler, angle to attach this
  problem. Instead of having Lucene manage the Cache itself, we let the
  application manage it, however Lucene will provide the necessary hooks
  in IndexReader to allow it. The hooks I have in mind are:
 
  (1) IndexReader current API for TermDocs, TermEnum, TermPositions etc. --
  already exists.
 
  (2) When reopen() is called, Lucene will take care to call a
  Cache.load(IndexReader), so that the application can pull whatever
  information
  it needs from the passed-in IndexReader.
 Would that do anything else than passing the new reader (if reopened)
 to the caches load method? I wonder if this is more than
 If(newReader != oldReader)
  Cache.load(newReader)

 If so something like that should be done on a segment reader anyway,
 right? From my perspective this isn't more than a callback or visitor
 that should walk though the subreaders and called for each reopened
 sub-reader. A cache-warming visitor / callback would then be trivial
 and the API would be more general.


  So to be more concrete on my proposal, I'd like to support caching in
  the following way (and while I've spent some time thinking about it, I'm
  sure there are great suggestions to improve it):
 
  * Application provides a CacheFactory to IndexReader.open/reopen, which
  exposes some very simple API, such as createCache, or
  initCache(IndexReader) etc. Something which returns a Cache object,
  which does not have very strict/concrete API.

 My first question would be why the reader should know about Cache if
 there is no strict / concrete API?
 I can follow you with