Re: Number of Boolean Clauses (AND vs OR)

2011-04-12 Thread Doug Cutting
On 04/12/2011 08:53 AM, Yang wrote:
 when you say executed in parallel, could you please elaborate more
 on what execute refers to?

I mean that all matches for a query clause are not generally enumerated
before any results of its containing query are enumerated.  There's
generally a single thread of execution, so it's not parallel in that sense.

Doug


Re: Number of Boolean Clauses (AND vs OR)

2011-04-11 Thread Doug Cutting
On 04/11/2011 01:25 PM, entdeveloper wrote:
 Thanks Otis. And by your answer, does this mean that individual clauses in a
 boolean query are executed sequentially? not in parallel?

Clauses are executed in parallel.  The execution of a conjunction is
able to efficiently skip occurrences in ranges of documents that do not
contain all clauses, while, for a disjunction, every occurrence of every
clause must be considered.

Doug


Re: [PMC] [DISCUSS] Lucy

2010-06-15 Thread Doug Cutting

On 06/14/2010 05:16 PM, Chris Hostetter wrote:

I'm having a hard time wrapping my head arround the idea of Lucy moving
from the Lucene TLP to the Incubator TLP.  It probably would have made
sense for Lucy start in the Incubator years ago, but I'm not really sure
what value that would add now given the status quo.  Would the Incubator
TLP be in a better position to help build the Lucy community then the
Lucene TLP?  Would it have more visibility then it does now?  Would the
Incubator PMC even *accept* Lucy since there are no IP issues, and the
existing committers/community are allready ASF commiters/community?


I think the question to ask first is whether the Lucene PMC is willing 
to continue incubating Lucy.  The Lucene PMC must either commit to 
providing some oversight, teaching Lucy the Apache way and guiding it to 
graduation as a TLP, or hand control to someone else who is willing, 
like the Incubator PMC.  If the majority of the PMC feel this is not a 
task they care to take on, then a move to the Incubator might make 
sense.  But if the majority feels they can provide oversight for the 
period of time required (6 more months?) then there's no reason to 
change things.


The anti-umbrella rule isn't hard-and-fast.  Umbrella's have often led 
to neglected projects and confusion, so they're generally discouraged. 
The rule of thumb is one-committer-community, one-project, but it's a 
rule of thumb.  If a PMC can demonstrate that it's on top of its 
subprojects, whether mature or incubating, the board should not object. 
 The board might however periodically inquire, e.g., if quarterly 
reports don't provide adequate evidence that subprojects are healthy and 
well monitored.


As long as Lucy is its subproject, the Lucene PMC has a responsibility 
to ensure that Lucy is nurtured.  This is not a huge task, but it's 
still a task.  Are you up for it or not?


Doug


Re: Less drastic ways

2010-03-16 Thread Doug Cutting

Grant Ingersoll wrote:

As
you've seen by the Board's indication, they only view that there
should be a single Lucene project.  One committership, one project.


Or, two committerships, two projects.  So, if the existing committer 
structure were acceptable, then Solr would be split to a separate TLP. 
That is the more common direction, for things to be split into separate 
TLPs as they grow.  Merging is an unusual experiment.


Doug


Re: [VOTE] merge lucene/solr development

2010-03-04 Thread Doug Cutting

Uwe Schindler wrote:

One idea: If we really make solr depend on the new lucene lib, solr
should not have lucene jars in its lib folder, but instead the
nightly build should fetch the jars from the lucene hudson build.


If Solr trunk is meant to always be based on Lucene trunk, then 
shouldn't they both be under a single trunk?  A Solr release would then 
not include an independent version of Lucene, but would rather be an 
alternate avenue for releasing the Lucene code.  One could tag, branch 
and release them separately or together.  If separately, then a Solr 
release tag might include a version of Lucene that's never released 
independently.


Doug


Re: Q re merging dev MLs

2010-03-04 Thread Doug Cutting

Otis Gospodnetic wrote:

Didn't we see Hadoop do
exactly the opposite?  Doug described it as code-base being too big,
but I'd say the original list was also too high traffic (was split
into common-, hdfs-, and mapreduce- I believe).


The intent is that eventually HDFS and MapReduce will evolve 
independently.  If a new MapReduce feature depends on a new HDFS 
feature, then MapReduce would have to wait until HDFS releases before it 
can release.  Over time the expectation is that this won't happen much. 
 MapReduce currently develops against a nightly snapshot build of 
HDFS and releases will be synchronized for a while yet, but it may 
someday switch to developing against released versions of HDFS.  So the 
direction there is the opposite of what's been proposed here: 
introducing layered project dependencies rather than reducing them. 
It's too soon to say how well that split will work, just as I think it's 
 difficult to guess how well merging Lucene and Solr will fare.  From 
experience, we know the downsides of Lucene and Solr being split, now we 
may have the opportunity to learn the downsides of being merged!


Doug


Re: [VOTE] merge lucene/solr development

2010-03-04 Thread Doug Cutting

Mattmann, Chris A (388J) wrote:

Doesn't this move towards having a shared code base,


Yes.  The desire seems to be to have a shared code base, no?


and more so the criteria for that of a TLP? Thoughts?


A shared codebase with a single pool of committers are canonical TLP 
attributes, if that's what you mean.


Doug


Re: Boosting on *unique* term matches without using MUST

2010-03-02 Thread Doug Cutting

This question probably belongs on java-user@, not gene...@.

That said, coord() might be what you're looking for:

http://lucene.apache.org/java/3_0_1/api/core/org/apache/lucene/search/Similarity.html#coord%28int,%20int%29

Doug

tavi.nathanson wrote:

Hey everyone,

Let me start with an example query: [apple orange banana]

I would like to heavily boost documents containing a greater number of
unique query terms (apple, orange, banana), without MUST'ing the terms; in
other words, a document containing just 2 unique terms (apple, banana)
should have a higher score than a document containing 10 or 20 of the same
term (10 apple's). I'm using SHOULD right now, and TF is defeating me;
documents containing a ton of the *same* term are overpowering documents
with a few unique terms.

Is there a standard way to accomplish what I'm looking for? I can think of
several hacks, but I don't really like them:
- I can do a union of query with MUST and a query with SHOULD, and boost the
MUST part, but that doesn't help me with a document that contains apple and
banana (but not orange).
- Perhaps I could lower the impact of TF (although I'm not sure what the
best way of doing this would be).

Thanks so much!


Re: Factor out a standalone, shared analysis package for Nutch/Solr/Lucene?

2010-03-01 Thread Doug Cutting

Ted Dunning wrote:

Hadoop is a strange beast.  The Hadoop core itself has fractured into three
projects that have independent mailing lists but which share release dates.


But without any releases yet.  Is that shared nothing?

The rationale for the Hadoop split was that the single codebase was too 
big and too active for developers to easily follow.  Splitting dev lists 
was an initial step towards someday splitting into separate TLPs.  The 
first post-split releases will be sync'd, but long term the expectation 
is that the release schedules may diverge.  (This is all my opinion: I 
have but one vote on the Hadoop PMC.)


Doug


Re: Factor out a standalone, shared analysis package for Nutch/Solr/Lucene?

2010-02-24 Thread Doug Cutting

Michael McCandless wrote:

I think, in order to stop duplicating our analysis code across
Nutch/Solr/Lucene, we should separate out the analyzers into a
standalone package, and maybe as its own sub-project under the Lucene
tlp?


Is the goal to release these on a separate schedule from Lucene Java? 
If so, then this makes sense, if not, then perhaps this could be simply 
a separate source code tree in Lucene Java built as separate jars.


Where would the analyzer APIs live, in the core or in the analyzer tree? 
 My guess is that they'd live in the core, and that the analyzer tree 
would depend on the core, but one might do it the other way around if 
one felt there were non-Lucene uses for analyzers.


Note that subprojects with different committer lists are an anti-pattern 
at Apache.  We've long done this in Lucene, but have recently been asked 
by the board to consider breaking most subprojects into their own TLPs. 
 Would analyzers someday make sense as an indepdendent TLP?  If not, 
then a subproject with disjoint committers might not be the right pattern.


Doug


Re: [DISCUSS] Archive Lucy

2009-03-09 Thread Doug Cutting

Grant Ingersoll wrote:

I'd _suggest_ a few other things beyond just code commits, however:


+1 for all these.


Finally, six months or so does sound like the right time frame.


So we can consider this fair warning: if there have only been a few 
token commits, and there isn't activity consistent with building a 
community, we mothball then.


Doug


Re: Lucene Meetup in Amsterdam?

2009-02-26 Thread Doug Cutting

FWIW, I can make Tuesday but not Monday.

Doug


Re: Solr graduates and joins Lucene as sub-project

2007-01-17 Thread Doug Cutting

Yonik Seeley wrote:

Solr has just graduated from the Incubator, and has been accepted as a
Lucene sub-project!


Congratulations and welcome!

Doug


New Technology Seeks To Let Startups Build Their Own Googles - Yahoo! News

2006-11-22 Thread Doug Cutting

Some good press for several Lucene projects!

http://news.yahoo.com/s/cmp/20061122/tc_cmp/195600041

Doug


Re: [PROPOSAL] index server project

2006-10-30 Thread Doug Cutting

Yonik Seeley wrote:

On 10/18/06, Doug Cutting [EMAIL PROTECTED] wrote:

We assume that, within an index, a file with a given name is written
only once.


Is this necessary, and will we need the lockless patch (that avoids
renaming or rewriting *any* files), or is Lucene's current index
behavior sufficient?


It's not strictly required, but it would make index synchronization a 
lot simpler.  Yes, I was assuming the lockless patch would be committed 
to Lucene before this project gets very far.  Something more than that 
would be required in order to keep old versions, but this could be as 
simple as a Directory subclass that refuses to remove files for a time.



The search side seems straightforward enough, but I haven't totally
figured out how the update side should work.


The master should be out of the loop as much as possible.  One approach 
is that clients randomly assign documents to indexes and send the 
updates directly to the indexing node.  Alternately, clients might index 
locally, then ship the updates to a node packaged as an index.  That was 
the intent of the addIndex method.



One potental problem is a document overwrite implemented as a delete
then an add.
More than one client doing this for the same document could result in
0 or 2 documents, instead of 1.  I guess clients will just need to be
relatively coordinated in their activities.


Good point.  Either the two clients must coordinate, to make sure that 
they're not updating the same document at the same time, or use a 
strategy where updates are routed to the slave that contained the old 
version of the document.  That would require a broadcast query to figure 
out which slave that is.



It's unfortunate the master needs to be involved on every document add.


That should not normally be the case.  Clients can cache the set of 
writable index locations and directly submit new documents without 
involving the master.



If deletes were broadcast, and documents could go to any partition,
that would be one way around it (with the downside of a less powerful
master that could implement certain distribution policies).
Another way to lessen the master-in-the-middle cost is to make sure
one can aggregate small requests:
   IndexLocation[] getUpdateableIndex(String[] id);


I'd assumed that the updateable version of an index does not move around 
very often.  Perhaps a lease mechanism is required.  For example, a call 
to getUpdateableIndex might be valid for ten minutes.


We might consider a delete() on the master interface too.  That way it 
could

 3) hide the delete policy (broadcast or directl-to-server-that-has-doc)
2) potentially do some batching of deletes
1) simply do the delete locally if there is a single index partition
and this is a combination master/searcher


I'm reticent to put any frequently-made call on the master.  I'd prefer 
to keep the master only involved at an executive level, with all 
per-document and per-query traffic going directly from client to slave.



It seems like the master might want to be involved in commits too, or
maybe we just rely on the slave to master heartbeat to kick of
immediately after a commit so that index replication can be initiated?


I like the latter approach.  New versions are only published as 
frequently as clients poll the master for updated IndexLocations. 
Clients keep a cache of both readable and updatable index locations that 
are periodically refreshed.


I was not imagining a real-time system, where the next query after a 
document is added would always include that document.  Is that a 
requirement?  That's harder.


At this point I'm mostly trying to see if this functionality would meet 
the needs of Solr, Nutch and others.


Must we include a notion of document identity and/or document version in 
the mechanism?  Would that facillitate updates and coherency?


In Nutch a typical case is that you have a bunch of URLs with content 
that may-or-may-not have been previously indexed.  The approach I'm 
currently leaning towards is that we'd broadcast the deletions of all of 
these to all slaves, then add index them to randomly assigned indexes. 
In Nutch multiple clients would naturally be coordinated, since each url 
is represented only once in each update cycle.


Doug


[PROPOSAL] index server project

2006-10-18 Thread Doug Cutting
It seems that Nutch and Solr would benefit from a shared index serving 
infrastructure.  Other Lucene-based projects might also benefit from 
this.  So perhaps we should start a new project to build such a thing. 
This could start either in java/contrib, or as a separate sub-project, 
depending on interest.


Here are some quick ideas about how this might work.

An RPC mechanism would be used to communicate between nodes (probably 
Hadoop's).  The system would be configured with a single master node 
that keeps track of where indexes are located, and a number of slave 
nodes that would maintain, search and replicate indexes.  Clients would 
talk to the master to find out which indexes to search or update, then 
they'll talk directly to slaves to perform searches and updates.


Following is an outline of how this might look.

We assume that, within an index, a file with a given name is written 
only once.  Index versions are sets of files, and a new version of an 
index is likely to share most files with the prior version.  Versions 
are numbered.  An index server should keep old versions of each index 
for a while, not immediately removing old files.


public class IndexVersion {
  String Id;   // unique name of the index
  int version; // the version of the index
}

public class IndexLocation {
  IndexVersion indexVersion;
  InetSocketAddress location;
}

public interface ClientToMasterProtocol {
  IndexLocation[] getSearchableIndexes();
  IndexLocation getUpdateableIndex(String id);
}

public interface ClientToSlaveProtocol {
  // normal update
  void addDocument(String index, Document doc);
  int[] removeDocuments(String index, Term term);
  void commitVersion(String index);

  // batch update
  void addIndex(String index, IndexLocation indexToAdd);

  // search
  SearchResults search(IndexVersion i, Query query, Sort sort, int n);
}

public interface SlaveToMasterProtocol {
  // sends currently searchable indexes
  // recieves updated indexes that we should replicate/update
  public IndexLocation[] heartbeat(IndexVersion[] searchableIndexes);
}

public interface SlaveToSlaveProtocol {
  String[] getFileSet(IndexVersion indexVersion);
  byte[] getFileContent(IndexVersion indexVersion, String file);
  // based on experience in Hadoop, we probably wouldn't really use
  // RPC to send file content, but rather HTTP.
}

The master thus maintains the set of indexes that are available for 
search, keeps track of which slave should handle changes to an index and 
initiates index synchronization between slaves.  The master can be 
configured to replicate indexes a specified number of times.


The client library can cache the current set of searchable indexes and 
periodically refresh it.  Searches are broadcast to one index with each 
id and return merged results.  The client will load-balance both 
searches and updates.


Deletions could be broadcast to all slaves.  That would probably be fast 
enough.  Alternately, indexes could be partitioned by a hash of each 
document's unique id, permitting deletions to be routed to the 
appropriate slave.


Does this make sense?  Does it sound like it would be useful to Solr? 
To Nutch?  To others?  Who would be interested and able to work on it?


Doug


Re: Infrastructure for large Lucene index

2006-10-06 Thread Doug Cutting

James wrote:

Let me check with the powers that be
here, and then get the code into a more polished form.  We hope to have it
really enterprise-ready over the next couple months.


Great!  Once you have permission, please post it sooner rather than 
later, then others can help with polishing, or at least be informed by 
your methods.  What I'd hate to happen is for you to get permission but 
never have time to polish it, and hence never contribute it.  An 
unpolished patch is better than no patch at all.


Thanks!

Doug


Re: apachecon

2006-09-28 Thread Doug Cutting

Chris Hostetter wrote:

Spamming general so people on other subprojects besides java-user see
this...

http://wiki.apache.org/apachecon/BirdsOfaFeatherUs06


I'll be there!

Doug


Re: IndexWriter and IndexReader open at the same time

2005-08-09 Thread Doug Cutting

Greg Love wrote:

In the TheServerSide case study of the book, page 375, they say that
they close the IndexWriter and even point out that they did before
openning the IndexReader and deleting. So that kinda makes me wonder
if i'm safe having an IndexReader with deletions and and IndexWriter
with inserts open at the same time (even though my code never does an
index modifying operation at the same time because they share a lock
in the my code).


You should close the IndexReader you are using for deletions before 
opening the IndexWriter you use for additions.  For higher throughput, 
queue additions and deletions and process them periodically as batches. 
 If you're concerned about an addition followed by a deletion of the 
same document getting reversed in the queues, then simply check the 
addition queue each time you queue a deletion, and remove any matching 
additions.


Doug