Batch Solr Server

2013-09-06 Thread gaoagong
Does anyone know if there is such a thing as a BatchSolrServer object in the
solrj code? I am currently using the ConcurrentUpdateSolrServer, but it
isn't doing quite what I expected. It will distribute the load of sending
through the http client through different threads and manage the
connections, but it does not package the documents in bundles. This can be
done manually by calling solrServer.add(Collection
documents), which will create an UpdateRequest object for the entire
collection. When the ConcurrentUpdateSolrServer gets to this UpdateRequest
it will send all of the documents together in a single http call.

What I want to be able to do is call solrServer.add(SolInputDocument
document) and have the SolrServer grab the next batch (up to a specified
size) and then create an UpdateRequest. This would reduce the number of
individual Requests the SOLR servers have to handle as well as any per http
call overhead incurred.

Would this kind of functionality be worth while to anyone else? Should I
create such a SolrServer object?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Batch-Solr-Server-tp4088657.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr Cloud hangs when replicating updates

2013-09-06 Thread Kevin Osborn
Thanks a ton Mark. I have tried SOLR-4816 and it didn't help. But I will
try Mark's patch next week, and see what happens.

-Kevin


On Thu, Sep 5, 2013 at 4:46 AM, Erick Erickson wrote:

> If you run into this again, try a jstack trace. You should see
> evidence of being stuck in SolrCmdDistributor on a variable
> called "semaphore"... On current 4x this is around line 420.
>
> If you're using SolrJ, then SOLR-4816 is another thing to try.
>
> But Mark's patch would be best of all to test, If that doesn't
> fix it then the jstack suggestion would at least tell us if it's
> the issue we think it is.
>
> FWIW,
> Erick
>
>
> On Wed, Sep 4, 2013 at 12:51 PM, Mark Miller 
> wrote:
>
> > It would be great if you could give this patch a try:
> > http://pastebin.com/raw.php?i=aaRWwSGP
> >
> > - Mark
> >
> >
> > On Wed, Sep 4, 2013 at 8:31 AM, Kevin Osborn 
> > wrote:
> >
> > > Thanks. If there is anything I can do to help you resolve this issue,
> let
> > > me know.
> > >
> > > -Kevin
> > >
> > >
> > > On Wed, Sep 4, 2013 at 7:51 AM, Mark Miller 
> > wrote:
> > >
> > > > Ill look at fixing the root issue for 4.5. I've been putting it off
> for
> > > > way to long.
> > > >
> > > > Mark
> > > >
> > > > Sent from my iPhone
> > > >
> > > > On Sep 3, 2013, at 2:15 PM, Kevin Osborn 
> > wrote:
> > > >
> > > > > I was having problems updating SolrCloud with a large batch of
> > records.
> > > > The
> > > > > records are coming in bursts with lulls between updates.
> > > > >
> > > > > At first, I just tried large updates of 100,000 records at a time.
> > > > > Eventually, this caused Solr to hang. When hung, I can still query
> > > Solr.
> > > > > But I cannot do any deletes or other updates to the index.
> > > > >
> > > > > At first, my updates were going as SolrJ CSV posts. I have also
> tried
> > > > local
> > > > > file updates and had similar results. I finally slowed things down
> to
> > > > just
> > > > > use SolrJ's Update feature, which is basically just JavaBin. I am
> > also
> > > > > sending over just 100 at a time in 10 threads. Again, it eventually
> > > hung.
> > > > >
> > > > > Sometimes, Solr hangs in the first couple of chunks. Other times,
> it
> > > > hangs
> > > > > right away.
> > > > >
> > > > > These are my commit settings:
> > > > >
> > > > > 
> > > > >   15000
> > > > >   5000
> > > > >   false
> > > > > 
> > > > > 
> > > > > 3
> > > > >   
> > > > >
> > > > > I have tried quite a few variations with the same results. I also
> > tried
> > > > > various JVM settings with the same results. The only variable seems
> > to
> > > be
> > > > > that reducing the cluster size from 2 to 1 is the only thing that
> > > helps.
> > > > >
> > > > > I also did a jstack trace. I did not see any explicit deadlocks,
> but
> > I
> > > > did
> > > > > see quite a few threads in WAITING or TIMED_WAITING. It is
> typically
> > > > > something like this:
> > > > >
> > > > >  java.lang.Thread.State: WAITING (parking)
> > > > >at sun.misc.Unsafe.park(Native Method)
> > > > >- parking to wait for  <0x00074039a450> (a
> > > > > java.util.concurrent.Semaphore$NonfairSync)
> > > > >at
> > > > java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> > > > >at
> > > > >
> > > >
> > >
> >
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
> > > > >at
> > > > >
> > > >
> > >
> >
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
> > > > >at
> > > > >
> > > >
> > >
> >
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
> > > > >at
> java.util.concurrent.Semaphore.acquire(Semaphore.java:317)
> > > > >at
> > > > >
> > > >
> > >
> >
> org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)
> > > > >at
> > > > >
> > > >
> > >
> >
> org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:418)
> > > > >at
> > > > >
> > > >
> > >
> >
> org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:368)
> > > > >at
> > > > >
> > > >
> > >
> >
> org.apache.solr.update.SolrCmdDistributor.flushAdds(SolrCmdDistributor.java:300)
> > > > >at
> > > > >
> > > >
> > >
> >
> org.apache.solr.update.SolrCmdDistributor.distribAdd(SolrCmdDistributor.java:139)
> > > > >at
> > > > >
> > > >
> > >
> >
> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:474)
> > > > >at
> > > > >
> > > >
> > >
> >
> org.apache.solr.handler.loader.CSVLoaderBase.doAdd(CSVLoaderBase.java:395)
> > > > >at
> > > > >
> > > >
> > >
> >
> org.apache.solr.handler.loader.SingleThreadedCSVLoader.addDoc(CSVLoader.java:44)
> > > > >at
> > > > >
> > >
> org.apache.solr.handler.loader.CSVLoaderBase.load(CSVLoaderB

Re: solrcloud shards backup/restoration

2013-09-06 Thread Tim Vaillancourt
I wouldn't say I love this idea, but wouldn't it be safe to LVM snapshot
the Solr index? I think this may even work on a live server, depending on
some file I/O details. Has anyone tried this?

An in-Solr solution sounds more elegant, but considering the tlog concern
Shalin mentioned, I think this may work as an interim solution.

Cheers!

Tim


On 6 September 2013 15:41, Aditya Sakhuja  wrote:

> Thanks Shalin and Mark for your responses. I am on the same page about the
> conventions for taking the backup. However, I am less sure about the
> restoration of the index. Lets say we have 3 shards across 3 solrcloud
> servers.
>
> 1.> I am assuming we should take a backup from each of the shard leaders to
> get a complete collection. do you think that will get the complete index (
> not worrying about what is not hard committed at the time of backup ). ?
>
> 2.> How do we go about restoring the index in a fresh solrcloud cluster ?
> From the structure of the snapshot I took, I did not see any
> replication.properties or index.properties  which I see normally on a
> healthy solrcloud cluster nodes.
> if I have the snapshot named snapshot.20130905 does the snapshot.20130905/*
> go into data/index ?
>
> Thanks
> Aditya
>
>
>
> On Fri, Sep 6, 2013 at 7:28 AM, Mark Miller  wrote:
>
> > Phone typing. The end should not say "don't hard commit" - it should say
> > "do a hard commit and take a snapshot".
> >
> > Mark
> >
> > Sent from my iPhone
> >
> > On Sep 6, 2013, at 7:26 AM, Mark Miller  wrote:
> >
> > > I don't know that it's too bad though - its always been the case that
> if
> > you do a backup while indexing, it's just going to get up to the last
> hard
> > commit. With SolrCloud that will still be the case. So just make sure you
> > do a hard commit right before taking the backup - yes, it might miss a
> few
> > docs in the tran log, but if you are taking a back up while indexing, you
> > don't have great precision in any case - you will roughly get a snapshot
> > for around that time - even without SolrCloud, if you are worried about
> > precision and getting every update into that backup, you want to stop
> > indexing and commit first. But if you just want a rough snapshot for
> around
> > that time, in both cases you can still just don't hard commit and take a
> > snapshot.
> > >
> > > Mark
> > >
> > > Sent from my iPhone
> > >
> > > On Sep 6, 2013, at 1:13 AM, Shalin Shekhar Mangar <
> > shalinman...@gmail.com> wrote:
> > >
> > >> The replication handler's backup command was built for pre-SolrCloud.
> > >> It takes a snapshot of the index but it is unaware of the transaction
> > >> log which is a key component in SolrCloud. Hence unless you stop
> > >> updates, commit your changes and then take a backup, you will likely
> > >> miss some updates.
> > >>
> > >> That being said, I'm curious to see how peer sync behaves when you try
> > >> to restore from a snapshot. When you say that you haven't been
> > >> successful in restoring, what exactly is the behaviour you observed?
> > >>
> > >> On Fri, Sep 6, 2013 at 5:14 AM, Aditya Sakhuja <
> > aditya.sakh...@gmail.com> wrote:
> > >>> Hello,
> > >>>
> > >>> I was looking for a good backup / recovery solution for the solrcloud
> > >>> indexes. I am more looking for restoring the indexes from the index
> > >>> snapshot, which can be taken using the replicationHandler's backup
> > command.
> > >>>
> > >>> I am looking for something that works with solrcloud 4.3 eventually,
> > but
> > >>> still relevant if you tested with a previous version.
> > >>>
> > >>> I haven't been successful in have the restored index replicate across
> > the
> > >>> new replicas, after I restart all the nodes, with one node having the
> > >>> restored index.
> > >>>
> > >>> Is restoring the indexes on all the nodes the best way to do it ?
> > >>> --
> > >>> Regards,
> > >>> -Aditya Sakhuja
> > >>
> > >>
> > >>
> > >> --
> > >> Regards,
> > >> Shalin Shekhar Mangar.
> >
>
>
>
> --
> Regards,
> -Aditya Sakhuja
>


Re: Facet Count and RegexTransformer>splitBy

2013-09-06 Thread Raheel Hasan
Hi,

What I want is very simple:

The "query" results:
row 1 = a,b,c,d
row 2 = a,f,r,e
row 3 = a,c,ff,e,b
..

facet count needed:
'a' = 3 occurrence
'b' = 2 occur.
'c' = 2 occur.
.
.
.


I searched and found a solution here:
http://stackoverflow.com/questions/9914483/solr-facet-multiple-words-with-comma-separated-values

But I want to be sure if it will work.



On Fri, Sep 6, 2013 at 8:20 PM, Jack Krupansky wrote:

> Facet counts are per field - your counts are scattered across different
> fields.
>
> There are additional capabilities in the facet component, but first you
> should describe exactly what your requirements are.
>
> -- Jack Krupansky
> -Original Message- From: Raheel Hasan
> Sent: Friday, September 06, 2013 9:58 AM
> To: solr-user@lucene.apache.org
> Subject: Facet Count and RegexTransformer>splitBy
>
>
> Hi guyz,
>
> Just a quick question:
>
> I have a field that has CSV values in the database. So I will use the
> DataImportHandler and will index it using RegexTransformer's splitBy
> attribute. However, since this is the first time I am doing it, I just
> wanted to be sure if it will work for Facet Count?
>
> For example:
> From "query" results (say this is the values in that field):
> row 1 = 1,2,3,4
> row 2 = 1,4,5,3
> row 3 = 2,1,20,66
> .
> .
> .
> .
> so facet count will get me:
> '1' = 3 occurrence
> '2' = 2 occur.
> .
> .
> .and so on.
>
>
>
>
>
> --
> Regards,
> Raheel Hasan
>



-- 
Regards,
Raheel Hasan


Re: SOLR 4.x vs 3.x parsedquery differences

2013-09-06 Thread Fermin Silva
Hi,

Our schema is identical except the version.
In 3.x it's 1.1 and in 4.x it's 1.5.

Also in solrconfig.xml we have no lucene version for 3.x (so it's using 2_4
i believe) and in 4.x we fixed it to 4_4.

Thanks
On Sep 6, 2013 3:34 PM, "Chris Hostetter"  wrote:

>
> : I'm migrating from 3.x to 4.x and I'm running some queries to verify that
> : everything works like before. I've found however that the query "galaxy
> s3"
> : is giving much less results. In 3.x numFound=1628, in 4.x numFound=70.
>
> is your entire schema 100% identical in both cases?
> what is the luceneMatchVersion set to in your solrconfig.xml?
>
>
> By the looks of your debug output, it appears that you are using
> autoGeneratePhraseQueries="true" in 3x, but have it set to false in 4x --
> but the fieldType you posted here shows it set to false
>
> :  : positionIncrementGap="100" autoGeneratePhraseQueries="false">
>
> ...i haven't tried to reproduce your specific situation, but that
> configuration doesn't smell right compared with what you are showing for
> the 3x output...
>
> : SOLR 3.x
> :
> : +(title_search_pt:galaxy
> : title_search_pt:galax) +MultiPhraseQuery(title_search_pt:"(sii s3 s)
> : 3")
> :
> : SOLR 4.x
> :
> : +((title_search_pt:galaxy
> : title_search_pt:galax)/no_coord) +(+title_search_pt:sii
> : +title_search_pt:s3 +title_search_pt:s +title_search_pt:3)/str>
>
>
> -Hoss
>


Unknown attribute id in add:allowDups

2013-09-06 Thread Brian Robinson

Hello,
I'm working with the Pecl package, with Solr 4.3.1. I have a doc defined 
in my schema where id is the uniqueKey,


multiValued="false" />

id

I tried to add a doc to my index with the following code (simplified for 
the question):


$client = new SolrClient($options);
$doc = new SolrInputDocument();
$doc->addField('id', 12345);
$doc->addField('description', 'This is the content of the doc');
$updateResponse = $client->addDocument($doc);

When I do this, the doc is not added to the index, and I get the 
following error in the logs in admin


 Unknown attribute id in add:allowDups

However, I noticed that if I change the field to type string:

required="true" multiValued="false" />

...
$doc->addField('id', '12345');

the doc is added to the index, but I still get the error in the log.

So first, I was wondering, is there some other way I should be setting 
this up so that id can be an int instead of a string?


And then I was also wondering what this error is referring to. Is there 
some further way I need to define id? Or maybe define the uniqueKey 
differently?


Any help would be much appreciated.
Thanks,
Brian


Re: solrcloud shards backup/restoration

2013-09-06 Thread Aditya Sakhuja
Thanks Shalin and Mark for your responses. I am on the same page about the
conventions for taking the backup. However, I am less sure about the
restoration of the index. Lets say we have 3 shards across 3 solrcloud
servers.

1.> I am assuming we should take a backup from each of the shard leaders to
get a complete collection. do you think that will get the complete index (
not worrying about what is not hard committed at the time of backup ). ?

2.> How do we go about restoring the index in a fresh solrcloud cluster ?
>From the structure of the snapshot I took, I did not see any
replication.properties or index.properties  which I see normally on a
healthy solrcloud cluster nodes.
if I have the snapshot named snapshot.20130905 does the snapshot.20130905/*
go into data/index ?

Thanks
Aditya



On Fri, Sep 6, 2013 at 7:28 AM, Mark Miller  wrote:

> Phone typing. The end should not say "don't hard commit" - it should say
> "do a hard commit and take a snapshot".
>
> Mark
>
> Sent from my iPhone
>
> On Sep 6, 2013, at 7:26 AM, Mark Miller  wrote:
>
> > I don't know that it's too bad though - its always been the case that if
> you do a backup while indexing, it's just going to get up to the last hard
> commit. With SolrCloud that will still be the case. So just make sure you
> do a hard commit right before taking the backup - yes, it might miss a few
> docs in the tran log, but if you are taking a back up while indexing, you
> don't have great precision in any case - you will roughly get a snapshot
> for around that time - even without SolrCloud, if you are worried about
> precision and getting every update into that backup, you want to stop
> indexing and commit first. But if you just want a rough snapshot for around
> that time, in both cases you can still just don't hard commit and take a
> snapshot.
> >
> > Mark
> >
> > Sent from my iPhone
> >
> > On Sep 6, 2013, at 1:13 AM, Shalin Shekhar Mangar <
> shalinman...@gmail.com> wrote:
> >
> >> The replication handler's backup command was built for pre-SolrCloud.
> >> It takes a snapshot of the index but it is unaware of the transaction
> >> log which is a key component in SolrCloud. Hence unless you stop
> >> updates, commit your changes and then take a backup, you will likely
> >> miss some updates.
> >>
> >> That being said, I'm curious to see how peer sync behaves when you try
> >> to restore from a snapshot. When you say that you haven't been
> >> successful in restoring, what exactly is the behaviour you observed?
> >>
> >> On Fri, Sep 6, 2013 at 5:14 AM, Aditya Sakhuja <
> aditya.sakh...@gmail.com> wrote:
> >>> Hello,
> >>>
> >>> I was looking for a good backup / recovery solution for the solrcloud
> >>> indexes. I am more looking for restoring the indexes from the index
> >>> snapshot, which can be taken using the replicationHandler's backup
> command.
> >>>
> >>> I am looking for something that works with solrcloud 4.3 eventually,
> but
> >>> still relevant if you tested with a previous version.
> >>>
> >>> I haven't been successful in have the restored index replicate across
> the
> >>> new replicas, after I restart all the nodes, with one node having the
> >>> restored index.
> >>>
> >>> Is restoring the indexes on all the nodes the best way to do it ?
> >>> --
> >>> Regards,
> >>> -Aditya Sakhuja
> >>
> >>
> >>
> >> --
> >> Regards,
> >> Shalin Shekhar Mangar.
>



-- 
Regards,
-Aditya Sakhuja


Re: SolrCloud 4.x hangs under high update volume

2013-09-06 Thread Tim Vaillancourt
Enjoy your trip, Mark! Thanks again for the help!

Tim

On 6 September 2013 14:18, Mark Miller  wrote:

> Okay, thanks, useful info. Getting on a plane, but ill look more at this
> soon. That 10k thread spike is good to know - that's no good and could
> easily be part of the problem. We want to keep that from happening.
>
> Mark
>
> Sent from my iPhone
>
> On Sep 6, 2013, at 2:05 PM, Tim Vaillancourt  wrote:
>
> > Hey Mark,
> >
> > The farthest we've made it at the same batch size/volume was 12 hours
> > without this patch, but that isn't consistent. Sometimes we would only
> get
> > to 6 hours or less.
> >
> > During the crash I can see an amazing spike in threads to 10k which is
> > essentially our ulimit for the JVM, but I strangely see no "OutOfMemory:
> > cannot open native thread errors" that always follow this. Weird!
> >
> > We also notice a spike in CPU around the crash. The instability caused
> some
> > shard recovery/replication though, so that CPU may be a symptom of the
> > replication, or is possibly the root cause. The CPU spikes from about
> > 20-30% utilization (system + user) to 60% fairly sharply, so the CPU,
> while
> > spiking isn't quite "pinned" (very beefy Dell R720s - 16 core Xeons,
> whole
> > index is in 128GB RAM, 6xRAID10 15k).
> >
> > More on resources: our disk I/O seemed to spike about 2x during the crash
> > (about 1300kbps written to 3500kbps), but this may have been the
> > replication, or ERROR logging (we generally log nothing due to
> > WARN-severity unless something breaks).
> >
> > Lastly, I found this stack trace occurring frequently, and have no idea
> > what it is (may be useful or not):
> >
> > "java.lang.IllegalStateException :
> >  at org.eclipse.jetty.server.Response.resetBuffer(Response.java:964)
> >  at org.eclipse.jetty.server.Response.sendError(Response.java:325)
> >  at
> >
> org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:692)
> >  at
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:380)
> >  at
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
> >  at
> >
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1423)
> >  at
> >
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:450)
> >  at
> >
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138)
> >  at
> >
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564)
> >  at
> >
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213)
> >  at
> >
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1083)
> >  at
> > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:379)
> >  at
> >
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175)
> >  at
> >
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1017)
> >  at
> >
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136)
> >  at
> >
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:258)
> >  at
> >
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109)
> >  at
> >
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
> >  at org.eclipse.jetty.server.Server.handle(Server.java:445)
> >  at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:260)
> >  at
> >
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:225)
> >  at
> >
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358)
> >  at
> >
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:596)
> >  at
> >
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:527)
> >  at java.lang.Thread.run(Thread.java:724)"
> >
> > On your live_nodes question, I don't have historical data on this from
> when
> > the crash occurred, which I guess is what you're looking for. I could add
> > this to our monitoring for future tests, however. I'd be glad to continue
> > further testing, but I think first more monitoring is needed to
> understand
> > this further. Could we come up with a list of metrics that would be
> useful
> > to see following another test and successful crash?
> >
> > Metrics needed:
> >
> > 1) # of live_nodes.
> > 2) Full stack traces.
> > 3) CPU used by Solr's JVM specifically (instead of system-wide).
> > 4) Solr's JVM thread count (already done)
> > 5) ?
> >
> > Cheers,
> >
> > Tim Vaillancourt
> >
> >
> > On 6 September 2013 13:11, Mark Miller  wrote:
> >
> >> Did you ever get to index that long before without hitting the deadlock?
> >>
> >> There really isn't anything negative the patch could be introducing,
> other
> >> than allowing for some mor

Re: SolrCloud 4.x hangs under high update volume

2013-09-06 Thread Tim Vaillancourt
Hey Mark,

The farthest we've made it at the same batch size/volume was 12 hours
without this patch, but that isn't consistent. Sometimes we would only get
to 6 hours or less.

During the crash I can see an amazing spike in threads to 10k which is
essentially our ulimit for the JVM, but I strangely see no "OutOfMemory:
cannot open native thread errors" that always follow this. Weird!

We also notice a spike in CPU around the crash. The instability caused some
shard recovery/replication though, so that CPU may be a symptom of the
replication, or is possibly the root cause. The CPU spikes from about
20-30% utilization (system + user) to 60% fairly sharply, so the CPU, while
spiking isn't quite "pinned" (very beefy Dell R720s - 16 core Xeons, whole
index is in 128GB RAM, 6xRAID10 15k).

More on resources: our disk I/O seemed to spike about 2x during the crash
(about 1300kbps written to 3500kbps), but this may have been the
replication, or ERROR logging (we generally log nothing due to
WARN-severity unless something breaks).

Lastly, I found this stack trace occurring frequently, and have no idea
what it is (may be useful or not):

"java.lang.IllegalStateException :
  at org.eclipse.jetty.server.Response.resetBuffer(Response.java:964)
  at org.eclipse.jetty.server.Response.sendError(Response.java:325)
  at
org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:692)
  at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:380)
  at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
  at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1423)
  at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:450)
  at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138)
  at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564)
  at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213)
  at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1083)
  at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:379)
  at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175)
  at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1017)
  at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136)
  at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:258)
  at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109)
  at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
  at org.eclipse.jetty.server.Server.handle(Server.java:445)
  at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:260)
  at
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:225)
  at
org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358)
  at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:596)
  at
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:527)
  at java.lang.Thread.run(Thread.java:724)"

On your live_nodes question, I don't have historical data on this from when
the crash occurred, which I guess is what you're looking for. I could add
this to our monitoring for future tests, however. I'd be glad to continue
further testing, but I think first more monitoring is needed to understand
this further. Could we come up with a list of metrics that would be useful
to see following another test and successful crash?

Metrics needed:

1) # of live_nodes.
2) Full stack traces.
3) CPU used by Solr's JVM specifically (instead of system-wide).
4) Solr's JVM thread count (already done)
5) ?

Cheers,

Tim Vaillancourt


On 6 September 2013 13:11, Mark Miller  wrote:

> Did you ever get to index that long before without hitting the deadlock?
>
> There really isn't anything negative the patch could be introducing, other
> than allowing for some more threads to possibly run at once. If I had to
> guess, I would say its likely this patch fixes the deadlock issue and your
> seeing another issue - which looks like the system cannot keep up with the
> requests or something for some reason - perhaps due to some OS networking
> settings or something (more guessing). Connection refused happens generally
> when there is nothing listening on the port.
>
> Do you see anything interesting change with the rest of the system? CPU
> usage spikes or something like that?
>
> Clamping down further on the overall number of threads night help (which
> would require making something configurable). How many nodes are listed in
> zk under live_nodes?
>
> Mark
>
> Sent from my iPhone
>
> On Sep 6, 2013, at 12:02 PM, Tim Vaillancourt 
> wrote:
>
> > Hey guys,
> >
> > (copy 

Re: unknown _stream_source_info while indexing rich doc in solr

2013-09-06 Thread Nutan
it shows type as undefined for dynamic field ignored_* , and I am using
default collection1 core,
but on the admin page it shows schema :
 







 





--
View this message in context: 
http://lucene.472066.n3.nabble.com/unknown-stream-source-info-while-indexing-rich-doc-in-solr-tp4088136p4088591.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Odd behavior after adding an additional core.

2013-09-06 Thread mike st. john
hi,

curl '
http://192.168.0.1:8983/solr/admin/collections?action=CREATE&name=collectionx&numShards=4&replicationFactor=1&collection.configName=config1
'

after that,  i added approx 100k documents,  verified there were in the
index and distributed across the shards.


i then decided to start adding some replicas via coreadmin.

curl '
http://192.168.0.1:8983/solr/admin/cores?action=CREATE&name=collectionx_ex_replica1&collection=collectionx&collection.configName=config1
'


adding the core, produced the following,   it took away leader status from
the leader on the shard it was replicating, inserted itself as down.
 changed the doc routing to implicit.


Thanks.



On Fri, Sep 6, 2013 at 4:24 AM, Shalin Shekhar Mangar <
shalinman...@gmail.com> wrote:

> Can you give exact steps to reproduce this problem?
>
> Also, are you sure you supplied numShards=4 while creating the collection?
>
> On Fri, Sep 6, 2013 at 12:20 AM, mike st. john  wrote:
> > using solr 4.4  , i used collection admin to create a collection  4shards
> > replication - factor of 1
> >
> > i did this so i could index my data, then bring in replicas later by
> adding
> > cores via coreadmin
> >
> >
> > i added a new core via coreadmin,  what i noticed shortly after adding
> the
> > core,  the leader of the shard where the new replica was placed was
> marked
> > active the new core marked as the leader  and the routing was now set to
> > implicit.
> >
> >
> >
> > i've replicated this on another solr setup as well.
> >
> >
> > Any ideas?
> >
> >
> > Thanks
> >
> > msj
>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>


Re: SolrCloud 4.x hangs under high update volume

2013-09-06 Thread Mark Miller
Okay, thanks, useful info. Getting on a plane, but ill look more at this soon. 
That 10k thread spike is good to know - that's no good and could easily be part 
of the problem. We want to keep that from happening. 

Mark

Sent from my iPhone

On Sep 6, 2013, at 2:05 PM, Tim Vaillancourt  wrote:

> Hey Mark,
> 
> The farthest we've made it at the same batch size/volume was 12 hours
> without this patch, but that isn't consistent. Sometimes we would only get
> to 6 hours or less.
> 
> During the crash I can see an amazing spike in threads to 10k which is
> essentially our ulimit for the JVM, but I strangely see no "OutOfMemory:
> cannot open native thread errors" that always follow this. Weird!
> 
> We also notice a spike in CPU around the crash. The instability caused some
> shard recovery/replication though, so that CPU may be a symptom of the
> replication, or is possibly the root cause. The CPU spikes from about
> 20-30% utilization (system + user) to 60% fairly sharply, so the CPU, while
> spiking isn't quite "pinned" (very beefy Dell R720s - 16 core Xeons, whole
> index is in 128GB RAM, 6xRAID10 15k).
> 
> More on resources: our disk I/O seemed to spike about 2x during the crash
> (about 1300kbps written to 3500kbps), but this may have been the
> replication, or ERROR logging (we generally log nothing due to
> WARN-severity unless something breaks).
> 
> Lastly, I found this stack trace occurring frequently, and have no idea
> what it is (may be useful or not):
> 
> "java.lang.IllegalStateException :
>  at org.eclipse.jetty.server.Response.resetBuffer(Response.java:964)
>  at org.eclipse.jetty.server.Response.sendError(Response.java:325)
>  at
> org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:692)
>  at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:380)
>  at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
>  at
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1423)
>  at
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:450)
>  at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138)
>  at
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564)
>  at
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213)
>  at
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1083)
>  at
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:379)
>  at
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175)
>  at
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1017)
>  at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136)
>  at
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:258)
>  at
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109)
>  at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
>  at org.eclipse.jetty.server.Server.handle(Server.java:445)
>  at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:260)
>  at
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:225)
>  at
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358)
>  at
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:596)
>  at
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:527)
>  at java.lang.Thread.run(Thread.java:724)"
> 
> On your live_nodes question, I don't have historical data on this from when
> the crash occurred, which I guess is what you're looking for. I could add
> this to our monitoring for future tests, however. I'd be glad to continue
> further testing, but I think first more monitoring is needed to understand
> this further. Could we come up with a list of metrics that would be useful
> to see following another test and successful crash?
> 
> Metrics needed:
> 
> 1) # of live_nodes.
> 2) Full stack traces.
> 3) CPU used by Solr's JVM specifically (instead of system-wide).
> 4) Solr's JVM thread count (already done)
> 5) ?
> 
> Cheers,
> 
> Tim Vaillancourt
> 
> 
> On 6 September 2013 13:11, Mark Miller  wrote:
> 
>> Did you ever get to index that long before without hitting the deadlock?
>> 
>> There really isn't anything negative the patch could be introducing, other
>> than allowing for some more threads to possibly run at once. If I had to
>> guess, I would say its likely this patch fixes the deadlock issue and your
>> seeing another issue - which looks like the system cannot keep up with the
>> requests or something for some reason - perhaps due to some OS networking
>> settings or something (more guessing). Connection refused happens gen

Re: SolrCloud 4.x hangs under high update volume

2013-09-06 Thread Mark Miller
Did you ever get to index that long before without hitting the deadlock?

There really isn't anything negative the patch could be introducing, other than 
allowing for some more threads to possibly run at once. If I had to guess, I 
would say its likely this patch fixes the deadlock issue and your seeing 
another issue - which looks like the system cannot keep up with the requests or 
something for some reason - perhaps due to some OS networking settings or 
something (more guessing). Connection refused happens generally when there is 
nothing listening on the port. 

Do you see anything interesting change with the rest of the system? CPU usage 
spikes or something like that?

Clamping down further on the overall number of threads night help (which would 
require making something configurable). How many nodes are listed in zk under 
live_nodes?

Mark

Sent from my iPhone

On Sep 6, 2013, at 12:02 PM, Tim Vaillancourt  wrote:

> Hey guys,
> 
> (copy of my post to SOLR-5216)
> 
> We tested this patch and unfortunately encountered some serious issues a
> few hours of 500 update-batches/sec. Our update batch is 10 docs, so we are
> writing about 5000 docs/sec total, using autoCommit to commit the updates
> (no explicit commits).
> 
> Our environment:
> 
>Solr 4.3.1 w/SOLR-5216 patch.
>Jetty 9, Java 1.7.
>3 solr instances, 1 per physical server.
>1 collection.
>3 shards.
>2 replicas (each instance is a leader and a replica).
>Soft autoCommit is 1000ms.
>Hard autoCommit is 15000ms.
> 
> After about 6 hours of stress-testing this patch, we see many of these
> stalled transactions (below), and the Solr instances start to see each
> other as down, flooding our Solr logs with "Connection Refused" exceptions,
> and otherwise no obviously-useful logs that I could see.
> 
> I did notice some stalled transactions on both /select and /update,
> however. This never occurred without this patch.
> 
> Stack /select seems stalled on: http://pastebin.com/Y1NCrXGC
> Stack /update seems stalled on: http://pastebin.com/cFLbC8Y9
> 
> Lastly, I have a summary of the ERROR-severity logs from this 24-hour soak.
> My script "normalizes" the ERROR-severity stack traces and returns them in
> order of occurrence.
> 
> Summary of my solr.log: http://pastebin.com/pBdMAWeb
> 
> Thanks!
> 
> Tim Vaillancourt
> 
> 
> On 6 September 2013 07:27, Markus Jelsma  wrote:
> 
>> Thanks!
>> 
>> -Original message-
>>> From:Erick Erickson 
>>> Sent: Friday 6th September 2013 16:20
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: SolrCloud 4.x hangs under high update volume
>>> 
>>> Markus:
>>> 
>>> See: https://issues.apache.org/jira/browse/SOLR-5216
>>> 
>>> 
>>> On Wed, Sep 4, 2013 at 11:04 AM, Markus Jelsma
>>> wrote:
>>> 
 Hi Mark,
 
 Got an issue to watch?
 
 Thanks,
 Markus
 
 -Original message-
> From:Mark Miller 
> Sent: Wednesday 4th September 2013 16:55
> To: solr-user@lucene.apache.org
> Subject: Re: SolrCloud 4.x hangs under high update volume
> 
> I'm going to try and fix the root cause for 4.5 - I've suspected
>> what it
 is since early this year, but it's never personally been an issue, so
>> it's
 rolled along for a long time.
> 
> Mark
> 
> Sent from my iPhone
> 
> On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt 
 wrote:
> 
>> Hey guys,
>> 
>> I am looking into an issue we've been having with SolrCloud since
>> the
>> beginning of our testing, all the way from 4.1 to 4.3 (haven't
>> tested
 4.4.0
>> yet). I've noticed other users with this same issue, so I'd really
 like to
>> get to the bottom of it.
>> 
>> Under a very, very high rate of updates (2000+/sec), after 1-12
>> hours
 we
>> see stalled transactions that snowball to consume all Jetty
>> threads in
 the
>> JVM. This eventually causes the JVM to hang with most threads
>> waiting
 on
>> the condition/stack provided at the bottom of this message. At this
 point
>> SolrCloud instances then start to see their neighbors (who also
>> have
 all
>> threads hung) as down w/"Connection Refused", and the shards become
 "down"
>> in state. Sometimes a node or two survives and just returns 503s
>> "no
 server
>> hosting shard" errors.
>> 
>> As a workaround/experiment, we have tuned the number of threads
>> sending
>> updates to Solr, as well as the batch size (we batch updates from
 client ->
>> solr), and the Soft/Hard autoCommits, all to no avail. Turning off
>> Client-to-Solr batching (1 update = 1 call to Solr), which also
>> did not
>> help. Certain combinations of update threads and batch sizes seem
>> to
>> mask/help the problem, but not resolve it entirely.
>> 
>> Our current environment is the following:
>> - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7.
>> - 3 x Zookeeper instances, external Jav

collections api setting dataDir

2013-09-06 Thread mike st. john
is there any way to change the dataDir while creating a collection via the
collection api?


RE: Solr 4.3 Startup with Multiple Cores Hangs on "Registering Core"

2013-09-06 Thread Austin Rasmussen
: Do all of your cores have "newSearcher" event listners configured or just
: 2 (i'm trying to figure out if it's a timing fluke that these two are 
stalled, or if it's something special about the configs)

All of my cores have both the "newSearcher" and "firstSearcher" event listeners 
configured. (The firstSearcher actually doesn't have any queries configured 
against it, so it probably should just be removed altogether)

: Can you try removing the newSearcher listners to confirm that that does in 
fact make the problem go away?

Removing the "newSearcher" listeners does not make the problem go away; 
however, removing the "firstSearcher" listener (even if the "newSearcher" 
listener is still configured) does make the problem go away.

: With the newSearcher listeners in place, Can you try setting 
"spellcheck=false" as a query param on the newSearcher listeners you have 
configured and 
: see if that works arround the problem?

Adding the "spellcheck=false" param to the "firstSearcher" listener does appear 
to work around the problem.

: Assuming it's just 2 cores using these listeners: can you reproduce this 
problem with a simpler seup where only one of the affected cores is in use?

Since it's not just these two cores, I'm not sure how to produce much of a 
simpler setup.  I did attempt to limit how many cores are loaded in the 
solr.xml, and found that if I cut it down to 56, it was able to load 
successfully (without any of the above config changed).

If I cut it down to 57 cores, it doesn't hang at "registering core" any more, 
it actually gets as far as " QuerySenderListener sending requests to 
Searcher@2f28849 main{StandardDirectoryReader(..."

If 58+ cores are loaded at start up, that's when it begins to hang at 
"registering core".  However, it always hangs on the *last* core configured in 
the solr.xml, regardless of how many cores are being loaded.


: can you reproduce using Solr 4.4?
: It would be helpful if you could create a jira and attach...
: * your complete configs -- or at least some configs similar to yours that are 
complete enough to reproduce the startup problem.  
: * some sample data (based on
: your initial description, i'm guessing there at least needs to be a handful 
of docs in the index -- and most likelye they need to match your warming query 
-: - but we don't need your actual indexes, just some docs that will work with 
your configs that we can index 
: & restart to see the problem. 
: * these thread dumps.

I can likely get to this early next week, both checking into how this behaves 
using Solr 4.4 and submitting a JIRA with your requested info.


Re: SOLR 4.x vs 3.x parsedquery differences

2013-09-06 Thread Shawn Heisey

On 9/6/2013 12:46 PM, Fermin Silva wrote:

Our schema is identical except the version.
In 3.x it's 1.1 and in 4.x it's 1.5.

Also in solrconfig.xml we have no lucene version for 3.x (so it's using 2_4
i believe) and in 4.x we fixed it to 4_4.


The autoGeneratePhraseQueries parameter didn't exist before schema 
version 1.4.


I'm fairly sure that for your schema that is at version 1.1, the 
autoGeneratePhraseQueries value specified in the field definition will 
be ignored and the actual value that gets used will be "true", which 
goes along with what Hoss has said.


See the comment about the version in the example schema on any 4.x Solr 
download.


Thanks,
Shawn



Re: SOLR 4.x vs 3.x parsedquery differences

2013-09-06 Thread Chris Hostetter

: Our schema is identical except the version.
: In 3.x it's 1.1 and in 4.x it's 1.5.

That's kind of a significant difference to leave out -- indepenent of the 
question you are asking about here, it's going to make quite a few 
differences in how things are being being parsed, and what defaults are.

If i'm understanding correctly: you like the behavior you are getting from 
Solr 3.x where phrases are generated automatically for you.

what i can't understand, is how/why phrases are being generated 
automatically for you if you have that 'autoGeneratePhraseQueries="false"' 
on your fieldType in your 3x schema ... that makes no sense to me.

if you didn't have "autoGeneratePhraseQueries" specified at all, then the 
'version="1.1"' would explain it (up to version=1.3, the default for 
autoGeneratePhraseQueries was true, but in version=1.4 and above, it 
defaults to false)  but with an explicit 
'autoGeneratePhraseQueries="false"' i can't explain why 3x works the way 
you say it works for you.

Bottom line: if you *want* the auto generated phrase query behavior 
in 4.x, you should just set 'autoGeneratePhraseQueries="true"' on your 
fieldType.



: > : I'm migrating from 3.x to 4.x and I'm running some queries to verify that
: > : everything works like before. I've found however that the query "galaxy
: > s3"
: > : is giving much less results. In 3.x numFound=1628, in 4.x numFound=70.
: >
: > is your entire schema 100% identical in both cases?
: > what is the luceneMatchVersion set to in your solrconfig.xml?
: >
: >
: > By the looks of your debug output, it appears that you are using
: > autoGeneratePhraseQueries="true" in 3x, but have it set to false in 4x --
: > but the fieldType you posted here shows it set to false
: >
: > :  : positionIncrementGap="100" autoGeneratePhraseQueries="false">
: >
: > ...i haven't tried to reproduce your specific situation, but that
: > configuration doesn't smell right compared with what you are showing for
: > the 3x output...
: >
: > : SOLR 3.x
: > :
: > : +(title_search_pt:galaxy
: > : title_search_pt:galax) +MultiPhraseQuery(title_search_pt:"(sii s3 s)
: > : 3")
: > :
: > : SOLR 4.x
: > :
: > : +((title_search_pt:galaxy
: > : title_search_pt:galax)/no_coord) +(+title_search_pt:sii
: > : +title_search_pt:s3 +title_search_pt:s +title_search_pt:3)/str>
: >
: >
: > -Hoss
: >
: 

-Hoss


RE: Solr 4.3 Startup with Multiple Cores Hangs on "Registering Core"

2013-09-06 Thread Chris Hostetter

: Sorry for the multi-post, seems like the .tdump files didn't get 
: attached.  I've tried attaching them as .txt files this time.

Interesting ... it looks like 2 of your cores are blocked in loaded while 
waiting for the searchers to open ... not clera if it's a deaklock or why 
though - in both cases the coreLoaderThread is trying to register stuff 
with JMX, which is asking for stats right off the bat (not sure why), 
which requires accessing the searcher and is waiting for that to be 
available.  but then you also have "newSearcher" listener events which 
are using the spellcheck componnent which is blocked waiting for that 
searcher as well.

Do all of your cores have "newSearcher" event listners configured or just 
2 (i'm trying to figure out if it's a timing fluke that these two are 
stalled, or if it's something special about the configs)

Can you try removing the newSearcher listners to confirm that that does in 
fact make the problem go away?

With the newSearcher listeners in place, Can you try setting 
"spellcheck=false" as a query param on the newSearcher listeners you have 
configured and see if that works arround the problem?

Assuming it's just 2 cores using these listeners: can you reproduce this 
problem with a simpler seup where only one of the affected cores is in 
use?

can you reproduce using Solr 4.4?


It would be helpful if you could create a jira and attach...

* your complete configs -- or at least some configs similar to 
yours that are complete enough to reproduce the startup problem.  
* some sample data (based on 
your initial description, i'm guessing there at least needs to be a 
handful of docs in the index -- and most likelye they need to match your 
warming query -- but we don't need your actual indexes, just some docs 
that will work with your configs that we can index & restart to see the 
problem. 
* these thread dumps.


-Hoss


Re: SolrCloud 4.x hangs under high update volume

2013-09-06 Thread Tim Vaillancourt
Hey guys,

(copy of my post to SOLR-5216)

We tested this patch and unfortunately encountered some serious issues a
few hours of 500 update-batches/sec. Our update batch is 10 docs, so we are
writing about 5000 docs/sec total, using autoCommit to commit the updates
(no explicit commits).

Our environment:

Solr 4.3.1 w/SOLR-5216 patch.
Jetty 9, Java 1.7.
3 solr instances, 1 per physical server.
1 collection.
3 shards.
2 replicas (each instance is a leader and a replica).
Soft autoCommit is 1000ms.
Hard autoCommit is 15000ms.

After about 6 hours of stress-testing this patch, we see many of these
stalled transactions (below), and the Solr instances start to see each
other as down, flooding our Solr logs with "Connection Refused" exceptions,
and otherwise no obviously-useful logs that I could see.

I did notice some stalled transactions on both /select and /update,
however. This never occurred without this patch.

Stack /select seems stalled on: http://pastebin.com/Y1NCrXGC
Stack /update seems stalled on: http://pastebin.com/cFLbC8Y9

Lastly, I have a summary of the ERROR-severity logs from this 24-hour soak.
My script "normalizes" the ERROR-severity stack traces and returns them in
order of occurrence.

Summary of my solr.log: http://pastebin.com/pBdMAWeb

Thanks!

Tim Vaillancourt


On 6 September 2013 07:27, Markus Jelsma  wrote:

> Thanks!
>
> -Original message-
> > From:Erick Erickson 
> > Sent: Friday 6th September 2013 16:20
> > To: solr-user@lucene.apache.org
> > Subject: Re: SolrCloud 4.x hangs under high update volume
> >
> > Markus:
> >
> > See: https://issues.apache.org/jira/browse/SOLR-5216
> >
> >
> > On Wed, Sep 4, 2013 at 11:04 AM, Markus Jelsma
> > wrote:
> >
> > > Hi Mark,
> > >
> > > Got an issue to watch?
> > >
> > > Thanks,
> > > Markus
> > >
> > > -Original message-
> > > > From:Mark Miller 
> > > > Sent: Wednesday 4th September 2013 16:55
> > > > To: solr-user@lucene.apache.org
> > > > Subject: Re: SolrCloud 4.x hangs under high update volume
> > > >
> > > > I'm going to try and fix the root cause for 4.5 - I've suspected
> what it
> > > is since early this year, but it's never personally been an issue, so
> it's
> > > rolled along for a long time.
> > > >
> > > > Mark
> > > >
> > > > Sent from my iPhone
> > > >
> > > > On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt 
> > > wrote:
> > > >
> > > > > Hey guys,
> > > > >
> > > > > I am looking into an issue we've been having with SolrCloud since
> the
> > > > > beginning of our testing, all the way from 4.1 to 4.3 (haven't
> tested
> > > 4.4.0
> > > > > yet). I've noticed other users with this same issue, so I'd really
> > > like to
> > > > > get to the bottom of it.
> > > > >
> > > > > Under a very, very high rate of updates (2000+/sec), after 1-12
> hours
> > > we
> > > > > see stalled transactions that snowball to consume all Jetty
> threads in
> > > the
> > > > > JVM. This eventually causes the JVM to hang with most threads
> waiting
> > > on
> > > > > the condition/stack provided at the bottom of this message. At this
> > > point
> > > > > SolrCloud instances then start to see their neighbors (who also
> have
> > > all
> > > > > threads hung) as down w/"Connection Refused", and the shards become
> > > "down"
> > > > > in state. Sometimes a node or two survives and just returns 503s
> "no
> > > server
> > > > > hosting shard" errors.
> > > > >
> > > > > As a workaround/experiment, we have tuned the number of threads
> sending
> > > > > updates to Solr, as well as the batch size (we batch updates from
> > > client ->
> > > > > solr), and the Soft/Hard autoCommits, all to no avail. Turning off
> > > > > Client-to-Solr batching (1 update = 1 call to Solr), which also
> did not
> > > > > help. Certain combinations of update threads and batch sizes seem
> to
> > > > > mask/help the problem, but not resolve it entirely.
> > > > >
> > > > > Our current environment is the following:
> > > > > - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7.
> > > > > - 3 x Zookeeper instances, external Java 7 JVM.
> > > > > - 1 collection, 3 shards, 2 replicas (each node is a leader of 1
> shard
> > > and
> > > > > a replica of 1 shard).
> > > > > - Log4j 1.2 for Solr logs, set to WARN. This log has no movement
> on a
> > > good
> > > > > day.
> > > > > - 5000 max jetty threads (well above what we use when we are
> healthy),
> > > > > Linux-user threads ulimit is 6000.
> > > > > - Occurs under Jetty 8 or 9 (many versions).
> > > > > - Occurs under Java 1.6 or 1.7 (several minor versions).
> > > > > - Occurs under several JVM tunings.
> > > > > - Everything seems to point to Solr itself, and not a Jetty or Java
> > > version
> > > > > (I hope I'm wrong).
> > > > >
> > > > > The stack trace that is holding up all my Jetty QTP threads is the
> > > > > following, which seems to be waiting on a lock that I would very
> much
> > > like
> > > > > to understand further:
> > > > >
> > > > > "java.lang.Thread.State: WAITING (

Re: CRLF Invalid Exception ?

2013-09-06 Thread Chris Hostetter

: I'm not sure if this means there's a bug in the client library I'm using
: (solrj 4.3) or is a bug in the server SOLR 4.3?  Or is there something in
: my data that's causing the issue?

It's unlikly that an error in the data you pass to SolrJ methods would be 
causing this problem -- i'm pretty sure it's not even a problem with the 
raw xml data being streamed, it appears to be a problem with how that data 
is getting shunked across the wire.

My best guess is that the most likely causes are either...
 * a bug in the HttpClient versio you are using on the client side
 * a bug in the ChunkedInputFilter you are using on the server side
 * a misconfiguration on the HttpClient object you are using with SolrJ
   (ie: claiming it's sending chunked when it's not?)


-Hoss


Re: SOLR 4.x vs 3.x parsedquery differences

2013-09-06 Thread Fermin Silva
Besides liking or not the behaviour we are getting in 3.x, Im required to
keep everything working as close as possible as before.

Have no idea why this is happening, but setting that field to true solved
the issue, now I get the exact same amount of items in both queries!

I wouldn't bother checking why that was so since we'll be moving away from
the older version, which shows the inconsistency.

But thanks a million.

If you have a SO user I can mark yours as answer here:
http://stackoverflow.com/questions/18661996/solr-4-x-vs-3-x-parsedquery-differences

Cheers
On Sep 6, 2013 4:15 PM, "Chris Hostetter"  wrote:

>
> : Our schema is identical except the version.
> : In 3.x it's 1.1 and in 4.x it's 1.5.
>
> That's kind of a significant difference to leave out -- indepenent of the
> question you are asking about here, it's going to make quite a few
> differences in how things are being being parsed, and what defaults are.
>
> If i'm understanding correctly: you like the behavior you are getting from
> Solr 3.x where phrases are generated automatically for you.
>
> what i can't understand, is how/why phrases are being generated
> automatically for you if you have that 'autoGeneratePhraseQueries="false"'
> on your fieldType in your 3x schema ... that makes no sense to me.
>
> if you didn't have "autoGeneratePhraseQueries" specified at all, then the
> 'version="1.1"' would explain it (up to version=1.3, the default for
> autoGeneratePhraseQueries was true, but in version=1.4 and above, it
> defaults to false)  but with an explicit
> 'autoGeneratePhraseQueries="false"' i can't explain why 3x works the way
> you say it works for you.
>
> Bottom line: if you *want* the auto generated phrase query behavior
> in 4.x, you should just set 'autoGeneratePhraseQueries="true"' on your
> fieldType.
>
>
>
> : > : I'm migrating from 3.x to 4.x and I'm running some queries to verify
> that
> : > : everything works like before. I've found however that the query
> "galaxy
> : > s3"
> : > : is giving much less results. In 3.x numFound=1628, in 4.x
> numFound=70.
> : >
> : > is your entire schema 100% identical in both cases?
> : > what is the luceneMatchVersion set to in your solrconfig.xml?
> : >
> : >
> : > By the looks of your debug output, it appears that you are using
> : > autoGeneratePhraseQueries="true" in 3x, but have it set to false in 4x
> --
> : > but the fieldType you posted here shows it set to false
> : >
> : > :  : > : positionIncrementGap="100" autoGeneratePhraseQueries="false">
> : >
> : > ...i haven't tried to reproduce your specific situation, but that
> : > configuration doesn't smell right compared with what you are showing
> for
> : > the 3x output...
> : >
> : > : SOLR 3.x
> : > :
> : > : +(title_search_pt:galaxy
> : > : title_search_pt:galax) +MultiPhraseQuery(title_search_pt:"(sii s3 s)
> : > : 3")
> : > :
> : > : SOLR 4.x
> : > :
> : > : +((title_search_pt:galaxy
> : > : title_search_pt:galax)/no_coord) +(+title_search_pt:sii
> : > : +title_search_pt:s3 +title_search_pt:s +title_search_pt:3)/str>
> : >
> : >
> : > -Hoss
> : >
> :
>
> -Hoss
>


Re: CRLF Invalid Exception ?

2013-09-06 Thread Brent Ryan
For what it's worth... I just updated to solrj 4.4 (even though my server
is solr 4.3) and it seems to have fixed the issue.

Thanks for the help!


On Fri, Sep 6, 2013 at 1:41 PM, Chris Hostetter wrote:

>
> : I'm not sure if this means there's a bug in the client library I'm using
> : (solrj 4.3) or is a bug in the server SOLR 4.3?  Or is there something in
> : my data that's causing the issue?
>
> It's unlikly that an error in the data you pass to SolrJ methods would be
> causing this problem -- i'm pretty sure it's not even a problem with the
> raw xml data being streamed, it appears to be a problem with how that data
> is getting shunked across the wire.
>
> My best guess is that the most likely causes are either...
>  * a bug in the HttpClient versio you are using on the client side
>  * a bug in the ChunkedInputFilter you are using on the server side
>  * a misconfiguration on the HttpClient object you are using with SolrJ
>(ie: claiming it's sending chunked when it's not?)
>
>
> -Hoss
>


Re: SOLR 4.x vs 3.x parsedquery differences

2013-09-06 Thread Chris Hostetter

: I'm migrating from 3.x to 4.x and I'm running some queries to verify that
: everything works like before. I've found however that the query "galaxy s3"
: is giving much less results. In 3.x numFound=1628, in 4.x numFound=70.

is your entire schema 100% identical in both cases?
what is the luceneMatchVersion set to in your solrconfig.xml?


By the looks of your debug output, it appears that you are using 
autoGeneratePhraseQueries="true" in 3x, but have it set to false in 4x -- 
but the fieldType you posted here shows it set to false

: 

...i haven't tried to reproduce your specific situation, but that 
configuration doesn't smell right compared with what you are showing for 
the 3x output...

: SOLR 3.x
: 
: +(title_search_pt:galaxy
: title_search_pt:galax) +MultiPhraseQuery(title_search_pt:"(sii s3 s)
: 3")
: 
: SOLR 4.x
: 
: +((title_search_pt:galaxy
: title_search_pt:galax)/no_coord) +(+title_search_pt:sii
: +title_search_pt:s3 +title_search_pt:s +title_search_pt:3)/str>


-Hoss


Re: unknown _stream_source_info while indexing rich doc in solr

2013-09-06 Thread Chris Hostetter

: it shows type as undefined for dynamic field ignored_* , and I am using

That means the running solr instance does not know anything about a 
dynamic field named ignored_* -- it doesn't exist.

: but on the admin page it shows schema :

the page showing hte schema file just tells you what's on disk -- it has 
no way of knowing if you modified that file after starting up solr.

... Wait a minute ... i see your problem now...

...
:  
: 

...your  declaration needs to be inside your  
block.


-Hoss


SOLR 4.x vs 3.x parsedquery differences

2013-09-06 Thread Raúl Cardozo
I'm migrating from 3.x to 4.x and I'm running some queries to verify that
everything works like before. I've found however that the query "galaxy s3"
is giving much less results. In 3.x numFound=1628, in 4.x numFound=70.

Here's the relevant schema part:


   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   

The synonyms involved in this query are:

siii, s3
galaxy, galax

My default search operator is AND (in both versions, even if it's
deprecated in 4.x), and the output of the debug is:

SOLR 3.x

+(title_search_pt:galaxy
title_search_pt:galax) +MultiPhraseQuery(title_search_pt:"(sii s3 s)
3")

SOLR 4.x

+((title_search_pt:galaxy
title_search_pt:galax)/no_coord) +(+title_search_pt:sii
+title_search_pt:s3 +title_search_pt:s +title_search_pt:3)/str>

The weird thing is that it does not return results like 'galaxy s3'. This
is the debug query:

no match on required clause (+title_search_pt:sii +title_search_pt:s3
+title_search_pt:s +title_search_pt:3)
(NON-MATCH) Failure to meet condition(s) of required/prohibited clause(s), *no
match on required clause (title_search_pt:sii)*
(NON-MATCH) no matching term
(MATCH) weight(title_search_pt:s3 in 1834535)
(MATCH) weight(title_search_pt:s in 1834535)
(MATCH) weight(title_search_pt:3 in 1834535)

How is that sii is *required* when it should be OR'ed with s and s3 ?

The analysis output shows that sii has token position 2, like it's
synonyms, like so:

galaxy  sii 3
galax   s3
s

Thanks,

Raúl Cardozo.


Re: CRLF Invalid Exception ?

2013-09-06 Thread Chris Hostetter

: Has anyone ever hit this when adding documents to SOLR?  What does it mean?

Always check for the root cause...

: Caused by: java.io.IOException: Invalid CRLF
: 
: at
: 
org.apache.coyote.http11.filters.ChunkedInputFilter.parseCRLF(ChunkedInputFilter.java:352)

...so while Solr is trying to read XML off the InputStream from the 
client, an error is encountered by the ChunkedInputFilter.  

I suspect the client library you are using for the HTTP connection is 
claiming it's using chunking but isn't, or is doing something wrong with 
the chunking, or there is a bug in the ChunkedInputFilter.


-Hoss


Re: Facet Count and RegexTransformer>splitBy

2013-09-06 Thread Raheel Hasan
basically, a field having a csv... and find counts / number of occurrance
of each csv value..


On Fri, Sep 6, 2013 at 8:54 PM, Raheel Hasan wrote:

> Hi,
>
> What I want is very simple:
>
> The "query" results:
> row 1 = a,b,c,d
> row 2 = a,f,r,e
> row 3 = a,c,ff,e,b
> ..
>
> facet count needed:
> 'a' = 3 occurrence
> 'b' = 2 occur.
> 'c' = 2 occur.
> .
> .
> .
>
>
> I searched and found a solution here:
>
> http://stackoverflow.com/questions/9914483/solr-facet-multiple-words-with-comma-separated-values
>
> But I want to be sure if it will work.
>
>
>
> On Fri, Sep 6, 2013 at 8:20 PM, Jack Krupansky wrote:
>
>> Facet counts are per field - your counts are scattered across different
>> fields.
>>
>> There are additional capabilities in the facet component, but first you
>> should describe exactly what your requirements are.
>>
>> -- Jack Krupansky
>> -Original Message- From: Raheel Hasan
>> Sent: Friday, September 06, 2013 9:58 AM
>> To: solr-user@lucene.apache.org
>> Subject: Facet Count and RegexTransformer>splitBy
>>
>>
>> Hi guyz,
>>
>> Just a quick question:
>>
>> I have a field that has CSV values in the database. So I will use the
>> DataImportHandler and will index it using RegexTransformer's splitBy
>> attribute. However, since this is the first time I am doing it, I just
>> wanted to be sure if it will work for Facet Count?
>>
>> For example:
>> From "query" results (say this is the values in that field):
>> row 1 = 1,2,3,4
>> row 2 = 1,4,5,3
>> row 3 = 2,1,20,66
>> .
>> .
>> .
>> .
>> so facet count will get me:
>> '1' = 3 occurrence
>> '2' = 2 occur.
>> .
>> .
>> .and so on.
>>
>>
>>
>>
>>
>> --
>> Regards,
>> Raheel Hasan
>>
>
>
>
> --
> Regards,
> Raheel Hasan
>



-- 
Regards,
Raheel Hasan


Re: CRLF Invalid Exception ?

2013-09-06 Thread Brent Ryan
Thanks.  I realized there's an error in the ChunkedInputFilter...

I'm not sure if this means there's a bug in the client library I'm using
(solrj 4.3) or is a bug in the server SOLR 4.3?  Or is there something in
my data that's causing the issue?


On Fri, Sep 6, 2013 at 1:02 PM, Chris Hostetter wrote:

>
> : Has anyone ever hit this when adding documents to SOLR?  What does it
> mean?
>
> Always check for the root cause...
>
> : Caused by: java.io.IOException: Invalid CRLF
> :
> : at
> :
> org.apache.coyote.http11.filters.ChunkedInputFilter.parseCRLF(ChunkedInputFilter.java:352)
>
> ...so while Solr is trying to read XML off the InputStream from the
> client, an error is encountered by the ChunkedInputFilter.
>
> I suspect the client library you are using for the HTTP connection is
> claiming it's using chunking but isn't, or is doing something wrong with
> the chunking, or there is a bug in the ChunkedInputFilter.
>
>
> -Hoss
>


Re: Facet Count and RegexTransformer>splitBy

2013-09-06 Thread Raheel Hasan
let me further elaborate:
[db>table1]
field1 = int
field2= string (solr indexing = true)
field3 = csv

[During import into solr]
splitBy=","

[After import]
solr will be searched for terms from field2.

[needed]
counts of occurrances of each value in csv



On Fri, Sep 6, 2013 at 9:35 PM, Raheel Hasan wrote:

> Its a csv from the database. I will import it like this, (say for example
> the field is 'emailids' and it contain csv of email ids):
> 
>
>
>
> On Fri, Sep 6, 2013 at 9:01 PM, Jack Krupansky wrote:
>
>> You're not being clear here - are the commas delimiting fields or do you
>> have one value per row?
>>
>> Yes, you can tokenize a comma-delimited value in Solr.
>>
>>
>> -- Jack Krupansky
>>
>> -Original Message- From: Raheel Hasan
>> Sent: Friday, September 06, 2013 11:54 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Facet Count and RegexTransformer>splitBy
>>
>>
>> Hi,
>>
>> What I want is very simple:
>>
>> The "query" results:
>> row 1 = a,b,c,d
>> row 2 = a,f,r,e
>> row 3 = a,c,ff,e,b
>> ..
>>
>> facet count needed:
>> 'a' = 3 occurrence
>> 'b' = 2 occur.
>> 'c' = 2 occur.
>> .
>> .
>> .
>>
>>
>> I searched and found a solution here:
>> http://stackoverflow.com/**questions/9914483/solr-facet-**
>> multiple-words-with-comma-**separated-values
>>
>> But I want to be sure if it will work.
>>
>>
>>
>> On Fri, Sep 6, 2013 at 8:20 PM, Jack Krupansky *
>> *wrote:
>>
>>  Facet counts are per field - your counts are scattered across different
>>> fields.
>>>
>>> There are additional capabilities in the facet component, but first you
>>> should describe exactly what your requirements are.
>>>
>>> -- Jack Krupansky
>>> -Original Message- From: Raheel Hasan
>>> Sent: Friday, September 06, 2013 9:58 AM
>>> To: solr-user@lucene.apache.org
>>> Subject: Facet Count and RegexTransformer>splitBy
>>>
>>>
>>> Hi guyz,
>>>
>>> Just a quick question:
>>>
>>> I have a field that has CSV values in the database. So I will use the
>>> DataImportHandler and will index it using RegexTransformer's splitBy
>>> attribute. However, since this is the first time I am doing it, I just
>>> wanted to be sure if it will work for Facet Count?
>>>
>>> For example:
>>> From "query" results (say this is the values in that field):
>>> row 1 = 1,2,3,4
>>> row 2 = 1,4,5,3
>>> row 3 = 2,1,20,66
>>> .
>>> .
>>> .
>>> .
>>> so facet count will get me:
>>> '1' = 3 occurrence
>>> '2' = 2 occur.
>>> .
>>> .
>>> .and so on.
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Regards,
>>> Raheel Hasan
>>>
>>>
>>
>>
>> --
>> Regards,
>> Raheel Hasan
>>
>
>
>
> --
> Regards,
> Raheel Hasan
>



-- 
Regards,
Raheel Hasan


Connection Established but waiting for response for a long time.

2013-09-06 Thread qungg
Hi,

I'm runing solr 4.0 but using legacy distributed search set up. I set the
shards parameter for search, but indexing into each solr shards directly.
The problem I have been experiencing is building connection with solr
shards. If I run a query, by using wget, to get number of records from each
individual shards (50 of them) sequentially, the request will hang at some
shards (seems random). The wget log will say the connection is established
but waiting for response. At that point I thought that the Solr shard might
be under high load, but the strange behavior happens when I send another
request to the send shard (using wget again) from another thread, the
response comes back, and will trigger something in Solr to send back
response for the first request I have sent before. 

This also happens in my daily indexing. If I send an commit, it will some
times hangs. However, if I send another commit to the same shard, both
commit will come back fine.

I'm running Solr on stock jetty server, and sometime back my boss told me to
set the maxIdleTime to 5000 for indexing purpose. I'm not sure if this
have anything to do with the strange behavior that I'm seeing right now. 

Please help me resolve this issue.

Thanks,
Qun



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Connection-Established-but-waiting-for-response-for-a-long-time-tp4088587.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Facet Count and RegexTransformer>splitBy

2013-09-06 Thread Raheel Hasan
Its a csv from the database. I will import it like this, (say for example
the field is 'emailids' and it contain csv of email ids):




On Fri, Sep 6, 2013 at 9:01 PM, Jack Krupansky wrote:

> You're not being clear here - are the commas delimiting fields or do you
> have one value per row?
>
> Yes, you can tokenize a comma-delimited value in Solr.
>
>
> -- Jack Krupansky
>
> -Original Message- From: Raheel Hasan
> Sent: Friday, September 06, 2013 11:54 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Facet Count and RegexTransformer>splitBy
>
>
> Hi,
>
> What I want is very simple:
>
> The "query" results:
> row 1 = a,b,c,d
> row 2 = a,f,r,e
> row 3 = a,c,ff,e,b
> ..
>
> facet count needed:
> 'a' = 3 occurrence
> 'b' = 2 occur.
> 'c' = 2 occur.
> .
> .
> .
>
>
> I searched and found a solution here:
> http://stackoverflow.com/**questions/9914483/solr-facet-**
> multiple-words-with-comma-**separated-values
>
> But I want to be sure if it will work.
>
>
>
> On Fri, Sep 6, 2013 at 8:20 PM, Jack Krupansky **
> wrote:
>
>  Facet counts are per field - your counts are scattered across different
>> fields.
>>
>> There are additional capabilities in the facet component, but first you
>> should describe exactly what your requirements are.
>>
>> -- Jack Krupansky
>> -Original Message- From: Raheel Hasan
>> Sent: Friday, September 06, 2013 9:58 AM
>> To: solr-user@lucene.apache.org
>> Subject: Facet Count and RegexTransformer>splitBy
>>
>>
>> Hi guyz,
>>
>> Just a quick question:
>>
>> I have a field that has CSV values in the database. So I will use the
>> DataImportHandler and will index it using RegexTransformer's splitBy
>> attribute. However, since this is the first time I am doing it, I just
>> wanted to be sure if it will work for Facet Count?
>>
>> For example:
>> From "query" results (say this is the values in that field):
>> row 1 = 1,2,3,4
>> row 2 = 1,4,5,3
>> row 3 = 2,1,20,66
>> .
>> .
>> .
>> .
>> so facet count will get me:
>> '1' = 3 occurrence
>> '2' = 2 occur.
>> .
>> .
>> .and so on.
>>
>>
>>
>>
>>
>> --
>> Regards,
>> Raheel Hasan
>>
>>
>
>
> --
> Regards,
> Raheel Hasan
>



-- 
Regards,
Raheel Hasan


Re: Facet Count and RegexTransformer>splitBy

2013-09-06 Thread Jack Krupansky
You're not being clear here - are the commas delimiting fields or do you 
have one value per row?


Yes, you can tokenize a comma-delimited value in Solr.

-- Jack Krupansky

-Original Message- 
From: Raheel Hasan

Sent: Friday, September 06, 2013 11:54 AM
To: solr-user@lucene.apache.org
Subject: Re: Facet Count and RegexTransformer>splitBy

Hi,

What I want is very simple:

The "query" results:
row 1 = a,b,c,d
row 2 = a,f,r,e
row 3 = a,c,ff,e,b
..

facet count needed:
'a' = 3 occurrence
'b' = 2 occur.
'c' = 2 occur.
.
.
.


I searched and found a solution here:
http://stackoverflow.com/questions/9914483/solr-facet-multiple-words-with-comma-separated-values

But I want to be sure if it will work.



On Fri, Sep 6, 2013 at 8:20 PM, Jack Krupansky 
wrote:



Facet counts are per field - your counts are scattered across different
fields.

There are additional capabilities in the facet component, but first you
should describe exactly what your requirements are.

-- Jack Krupansky
-Original Message- From: Raheel Hasan
Sent: Friday, September 06, 2013 9:58 AM
To: solr-user@lucene.apache.org
Subject: Facet Count and RegexTransformer>splitBy


Hi guyz,

Just a quick question:

I have a field that has CSV values in the database. So I will use the
DataImportHandler and will index it using RegexTransformer's splitBy
attribute. However, since this is the first time I am doing it, I just
wanted to be sure if it will work for Facet Count?

For example:
From "query" results (say this is the values in that field):
row 1 = 1,2,3,4
row 2 = 1,4,5,3
row 3 = 2,1,20,66
.
.
.
.
so facet count will get me:
'1' = 3 occurrence
'2' = 2 occur.
.
.
.and so on.





--
Regards,
Raheel Hasan





--
Regards,
Raheel Hasan 



CRLF Invalid Exception ?

2013-09-06 Thread Brent Ryan
Has anyone ever hit this when adding documents to SOLR?  What does it mean?


ERROR [http-8983-6] 2013-09-06 10:09:32,700 SolrException.java (line 108)
org.apache.solr.common.SolrException: Invalid CRLF

at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:175)

at
org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)

at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)

at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)

at org.apache.solr.core.SolrCore.execute(SolrCore.java:1817)

at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:663)

at
com.datastax.bdp.cassandra.index.solr.CassandraDispatchFilter.execute(CassandraDispatchFilter.java:176)

at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359)

at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)

at
com.datastax.bdp.cassandra.index.solr.CassandraDispatchFilter.doFilter(CassandraDispatchFilter.java:139)

at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)

at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)

at
com.datastax.bdp.cassandra.audit.SolrHttpAuditLogFilter.doFilter(SolrHttpAuditLogFilter.java:194)

at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)

at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)

at
com.datastax.bdp.cassandra.index.solr.auth.CassandraAuthorizationFilter.doFilter(CassandraAuthorizationFilter.java:95)

at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)

at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)

at
com.datastax.bdp.cassandra.index.solr.auth.DseAuthenticationFilter.doFilter(DseAuthenticationFilter.java:102)

at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)

at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)

at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)

at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)

at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)

at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)

at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)

at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)

at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:859)

at
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)

at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)

at java.lang.Thread.run(Thread.java:722)

Caused by: com.ctc.wstx.exc.WstxIOException: Invalid CRLF

at com.ctc.wstx.sr.StreamScanner.throwFromIOE(StreamScanner.java:708)

at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1086)

at org.apache.solr.handler.loader.XMLLoader.readDoc(XMLLoader.java:387)

at
org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:245)

at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)

... 30 more

Caused by: java.io.IOException: Invalid CRLF

at
org.apache.coyote.http11.filters.ChunkedInputFilter.parseCRLF(ChunkedInputFilter.java:352)

at
org.apache.coyote.http11.filters.ChunkedInputFilter.doRead(ChunkedInputFilter.java:151)

at
org.apache.coyote.http11.InternalInputBuffer.doRead(InternalInputBuffer.java:710)

at org.apache.coyote.Request.doRead(Request.java:428)

at
org.apache.catalina.connector.InputBuffer.realReadBytes(InputBuffer.java:304)

at org.apache.tomcat.util.buf.ByteChunk.substract(ByteChunk.java:403)

at org.apache.catalina.connector.InputBuffer.read(InputBuffer.java:327)

at
org.apache.catalina.connector.CoyoteInputStream.read(CoyoteInputStream.java:162)

at com.ctc.wstx.io.UTF8Reader.loadMore(UTF8Reader.java:365)

at com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:110)

at com.ctc.wstx.io.MergedReader.read(MergedReader.java:101)

at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:84)

at
com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java:57)

at
com.ctc.wstx.sr.StreamScanner.loadMoreFromCurrent(StreamScanner.java:1046)

at com.ctc.wstx.sr.StreamScanner.parseLocalName2(StreamScanner.java:1796)

at com.ctc.wstx.sr.StreamScanner.parseLocalName(StreamScanner.java:1756)

at
com.ctc.wstx.sr.BasicStreamReader.handleStartElem(BasicStreamReader.java:2914)

at
com.ctc.wstx.sr.BasicStreamReader.nextFromTree(BasicStreamReader.java:2848)

at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1019)

... 33 more


Re: Regarding improving performance of the solr

2013-09-06 Thread Shawn Heisey
On 9/6/2013 2:54 AM, prabu palanisamy wrote:
> I am currently using solr -3.5.0,  indexed  wikipedia dump (50 gb) with
> java 1.6.
> I am searching the solr with text (which is actually twitter tweets) .
> Currently it takes average time of 210 millisecond for each post, out of
> which 200 millisecond is consumed by solr server (QTime).  I used the
> jconsole monitor tool.

If the size of all your Solr indexes on disk is in the 50GB range of
your wikipedia dump, then for ideal performance, you'll want to have
50GB of free memory so the OS can cache your index.  You might be able
to get by with 25-30GB of free memory, depending on your index composition.

Note that this is memory over and above what you allocate to the Solr
JVM, and memory used by other processes on the machine.  If you do have
other services on the same machine, note that those programs might ALSO
require OS disk cache RAM.

http://wiki.apache.org/solr/SolrPerformanceProblems#OS_Disk_Cache

Thanks,
Shawn



Re: charfilter doesn't do anything

2013-09-06 Thread Andreas Owen
ok i have html pages with .content i 
want.. i want to extract (index, store) only that 
between the body-comments. i thought regexTransformer would be the best because 
xpath doesn't work in tika and i cant nest a xpathEntetyProcessor to use xpath. 
what i have also found out is that the htmlparser from tika cuts my 
body-comments out and tries to make well formed html, which i would like to 
switch off.

On 6. Sep 2013, at 5:04 PM, Shawn Heisey wrote:

> On 9/6/2013 7:09 AM, Andreas Owen wrote:
>> i've managed to get it working if i use the regexTransformer and string is 
>> on the same line in my tika entity. but when the string is multilined it 
>> isn't working even though i tried ?s to set the flag dotall.
>> 
>> > dataSource="dataUrl" onError="skip" htmlMapper="identity" format="html" 
>> transformer="RegexTransformer">
>>  > replaceWith="QQQ" sourceColName="text"  />
>> 
>>  
>> then i tried it like this and i get a stackoverflow
>> 
>> > replaceWith="QQQ" sourceColName="text"  />
>> 
>> in javascript this works but maybe because i only used a small string.
> 
> Sounds like we've got an XY problem here.
> 
> http://people.apache.org/~hossman/#xyproblem
> 
> How about you tell us *exactly* what you'd actually like to have happen
> and then we can find a solution for you?
> 
> It sounds a little bit like you're interested in stripping all the HTML
> tags out.  Perhaps the HTMLStripCharFilter?
> 
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
> 
> Something that I already said: By using the KeywordTokenizer, you won't
> be able to search for individual words on your HTML input.  The entire
> input string is treated as a single token, and therefore ONLY exact
> entire-field matches (or certain wildcard matches) will be possible.
> 
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.KeywordTokenizerFactory
> 
> Note that no matter what you do to your data with the analysis chain,
> Solr will always return the text that was originally indexed in search
> results.  If you need to affect what gets stored as well, perhaps you
> need an Update Processor.
> 
> Thanks,
> Shawn



Re: Store 2 dimensional array( of int values) in solr 4.0

2013-09-06 Thread Jack Krupansky

You still haven't supplied any queries.

If all you really need is the JSON as a blob, simply store it as a string 
and parse the JSON in your application layer.


-- Jack Krupansky

-Original Message- 
From: A Geek

Sent: Friday, September 06, 2013 10:30 AM
To: solr user
Subject: RE: Store 2 dimensional array( of int values) in solr 4.0

Hi,Thanks for the quick reply. Sure, please find below the details as per 
your query.
Essentially, I want to retrieve the doc through JSON [using JSON format as 
SOLR result output]and want JSON to pick the the data from the dataX field 
as a two dimensional array of ints. When I store the data as show below, it 
shows up in JSON array of strings where the internal array is basically 
shown as strings (because thats how the field is configured and I'm storing, 
not finding any other option). Following is the current JSON output that I'm 
able to fetch:
"dataX":["[20130614, 2]","[20130615, 11]","[20130616, 1]","[20130617, 
1]","[20130619, 8]","[20130620, 5]","[20130623, 5]"]

whereas I want  to fetch the dataX as something like:
"dataX":[[20130614, 2],[20130615, 11],[20130616, 1],[20130617, 1],[20130619, 
8],[20130620, 5],[20130623, 5]]
as can be seen, the dataX is essentially a 2D array where the internal array 
is of two ints, one being date and other being the count.

Please point me in the right direction. Appreciate your time.
Thanks.


From: j...@basetechnology.com
To: solr-user@lucene.apache.org
Subject: Re: Store 2 dimensional array( of int values) in solr 4.0
Date: Fri, 6 Sep 2013 08:44:06 -0400

First you need to tell us how you wish to use and query the data. That 
will
largely determine how the data must be stored. Give us a few example 
queries

of how you would like your application to be able to access the data.

Note that Lucene has only simple multivalued fields - no structure or
nesting within a single field other that a list of scalar values.

But you can always store a complex structure as a BSON blob or JSON string
if all you want is to store and retrieve it in its entirety without 
querying
its internal structure. And note that Lucene queries are field level - 
does

a field contain or match a scalar value.

-- Jack Krupansky

-Original Message- 
From: A Geek

Sent: Friday, September 06, 2013 7:10 AM
To: solr user
Subject: Store 2 dimensional array( of int values) in solr 4.0

hi All, I'm trying to store a 2 dimensional array in SOLR [version 4.0].
Basically I've the following data:
[[20121108, 1],[20121110, 7],[2012, 2],[20121112, 2]] ...

The inner array being used to keep some count say X for that particular 
day.

Currently, I'm using the following field to store this data:

and I'm using python library pySolr to store the data. Currently the data
that gets stored looks like this(its array of strings)
[20121108, 1][20121110,
7][2012, 2][20121112, 2][20121113,
2][20121116, 1]
Is there a way, i can store the 2 dimensional array and the inner array 
can

contain int values, like the one shown in the beginning example, such that
the the final/stored data in SOLR looks something like: 
20121108  7  
 20121110 12 
 20121110 12 

Just a guess, I think for this case, we need to add one more field[the 
index

for instance], for each inner array which will again be multivalued (which
will store int values only)? How do I add the actual 2 dimensional array,
how to pass the inner arrays and how to store the full doc that contains
this 2 dimensional array. Please help me out sort this issue.
Please share your views and point me in the right direction. Any help 
would

be highly appreciated.
I found similar things on the web, but not the one I'm looking for:
http://lucene.472066.n3.nabble.com/Two-dimensional-array-in-Solr-schema-td4003309.html
Thanks






Re: charfilter doesn't do anything

2013-09-06 Thread Shawn Heisey
On 9/6/2013 7:09 AM, Andreas Owen wrote:
> i've managed to get it working if i use the regexTransformer and string is on 
> the same line in my tika entity. but when the string is multilined it isn't 
> working even though i tried ?s to set the flag dotall.
> 
>  dataSource="dataUrl" onError="skip" htmlMapper="identity" format="html" 
> transformer="RegexTransformer">
>replaceWith="QQQ" sourceColName="text"  />
> 
>   
> then i tried it like this and i get a stackoverflow
> 
>  replaceWith="QQQ" sourceColName="text"  />
> 
> in javascript this works but maybe because i only used a small string.

Sounds like we've got an XY problem here.

http://people.apache.org/~hossman/#xyproblem

How about you tell us *exactly* what you'd actually like to have happen
and then we can find a solution for you?

It sounds a little bit like you're interested in stripping all the HTML
tags out.  Perhaps the HTMLStripCharFilter?

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory

Something that I already said: By using the KeywordTokenizer, you won't
be able to search for individual words on your HTML input.  The entire
input string is treated as a single token, and therefore ONLY exact
entire-field matches (or certain wildcard matches) will be possible.

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.KeywordTokenizerFactory

Note that no matter what you do to your data with the analysis chain,
Solr will always return the text that was originally indexed in search
results.  If you need to affect what gets stored as well, perhaps you
need an Update Processor.

Thanks,
Shawn



Re: Facet Count and RegexTransformer>splitBy

2013-09-06 Thread Jack Krupansky
Facet counts are per field - your counts are scattered across different 
fields.


There are additional capabilities in the facet component, but first you 
should describe exactly what your requirements are.


-- Jack Krupansky
-Original Message- 
From: Raheel Hasan

Sent: Friday, September 06, 2013 9:58 AM
To: solr-user@lucene.apache.org
Subject: Facet Count and RegexTransformer>splitBy

Hi guyz,

Just a quick question:

I have a field that has CSV values in the database. So I will use the
DataImportHandler and will index it using RegexTransformer's splitBy
attribute. However, since this is the first time I am doing it, I just
wanted to be sure if it will work for Facet Count?

For example:

From "query" results (say this is the values in that field):

row 1 = 1,2,3,4
row 2 = 1,4,5,3
row 3 = 2,1,20,66
.
.
.
.
so facet count will get me:
'1' = 3 occurrence
'2' = 2 occur.
.
.
.and so on.





--
Regards,
Raheel Hasan 



Re: Solr 4.3 Startup with Multiple Cores Hangs on "Registering Core"

2013-09-06 Thread Erick Erickson
bq: I'm actually not using the transaction log (or the
NRTCachingDirectoryFactory); it's currently set up to use the
MMapDirectoryFactory,

This isn't relevant to whether you're using the update log or not, this is
just how the index is handled. Look for something in your solrconfig.xml
like:
 
  ${solr.ulog.dir:}


The other thing to check is if you have files in a "tlog" directory that's
a sibling to your index directory as Hoss suggested.

You may well NOT have any transaction log, but it's something to check.



Re: SolrCloud 4.x hangs under high update volume

2013-09-06 Thread Erick Erickson
Markus:

See: https://issues.apache.org/jira/browse/SOLR-5216


On Wed, Sep 4, 2013 at 11:04 AM, Markus Jelsma
wrote:

> Hi Mark,
>
> Got an issue to watch?
>
> Thanks,
> Markus
>
> -Original message-
> > From:Mark Miller 
> > Sent: Wednesday 4th September 2013 16:55
> > To: solr-user@lucene.apache.org
> > Subject: Re: SolrCloud 4.x hangs under high update volume
> >
> > I'm going to try and fix the root cause for 4.5 - I've suspected what it
> is since early this year, but it's never personally been an issue, so it's
> rolled along for a long time.
> >
> > Mark
> >
> > Sent from my iPhone
> >
> > On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt 
> wrote:
> >
> > > Hey guys,
> > >
> > > I am looking into an issue we've been having with SolrCloud since the
> > > beginning of our testing, all the way from 4.1 to 4.3 (haven't tested
> 4.4.0
> > > yet). I've noticed other users with this same issue, so I'd really
> like to
> > > get to the bottom of it.
> > >
> > > Under a very, very high rate of updates (2000+/sec), after 1-12 hours
> we
> > > see stalled transactions that snowball to consume all Jetty threads in
> the
> > > JVM. This eventually causes the JVM to hang with most threads waiting
> on
> > > the condition/stack provided at the bottom of this message. At this
> point
> > > SolrCloud instances then start to see their neighbors (who also have
> all
> > > threads hung) as down w/"Connection Refused", and the shards become
> "down"
> > > in state. Sometimes a node or two survives and just returns 503s "no
> server
> > > hosting shard" errors.
> > >
> > > As a workaround/experiment, we have tuned the number of threads sending
> > > updates to Solr, as well as the batch size (we batch updates from
> client ->
> > > solr), and the Soft/Hard autoCommits, all to no avail. Turning off
> > > Client-to-Solr batching (1 update = 1 call to Solr), which also did not
> > > help. Certain combinations of update threads and batch sizes seem to
> > > mask/help the problem, but not resolve it entirely.
> > >
> > > Our current environment is the following:
> > > - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7.
> > > - 3 x Zookeeper instances, external Java 7 JVM.
> > > - 1 collection, 3 shards, 2 replicas (each node is a leader of 1 shard
> and
> > > a replica of 1 shard).
> > > - Log4j 1.2 for Solr logs, set to WARN. This log has no movement on a
> good
> > > day.
> > > - 5000 max jetty threads (well above what we use when we are healthy),
> > > Linux-user threads ulimit is 6000.
> > > - Occurs under Jetty 8 or 9 (many versions).
> > > - Occurs under Java 1.6 or 1.7 (several minor versions).
> > > - Occurs under several JVM tunings.
> > > - Everything seems to point to Solr itself, and not a Jetty or Java
> version
> > > (I hope I'm wrong).
> > >
> > > The stack trace that is holding up all my Jetty QTP threads is the
> > > following, which seems to be waiting on a lock that I would very much
> like
> > > to understand further:
> > >
> > > "java.lang.Thread.State: WAITING (parking)
> > >at sun.misc.Unsafe.park(Native Method)
> > >- parking to wait for  <0x0007216e68d8> (a
> > > java.util.concurrent.Semaphore$NonfairSync)
> > >at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> > >at
> > >
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
> > >at
> > >
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
> > >at
> > >
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
> > >at java.util.concurrent.Semaphore.acquire(Semaphore.java:317)
> > >at
> > >
> org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)
> > >at
> > >
> org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:418)
> > >at
> > >
> org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:368)
> > >at
> > >
> org.apache.solr.update.SolrCmdDistributor.flushAdds(SolrCmdDistributor.java:300)
> > >at
> > >
> org.apache.solr.update.SolrCmdDistributor.finish(SolrCmdDistributor.java:96)
> > >at
> > >
> org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:462)
> > >at
> > >
> org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1178)
> > >at
> > >
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83)
> > >at
> > >
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> > >at org.apache.solr.core.SolrCore.execute(SolrCore.java:1820)
> > >at
> > >
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656)
> > >at
> > >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359)
> > >at
> >

Re: Tweaking boosts for more search results variety

2013-09-06 Thread Sai Gadde
Thank you Jack for the suggestion.

We can try group by site. But considering that number of sites are only
about 1000 against the index size of 5 million, One can expect most of the
hits would be hidden and for certain specific keywords only a handful of
actual results could be displayed if results are grouped by site.

we already group on a signature field to identify duplicate content in
these 5 million+ docs. But here the number of duplicates are only about
3-5% maximum.

Is there any workaround for these limitations with grouping?

Thanks
Shyam



On Thu, Sep 5, 2013 at 9:16 PM, Jack Krupansky wrote:

> The grouping (field collapsing) feature somewhat addresses this - group by
> a "site" field and then if more than one or a few top pages are from the
> same site they get grouped or collapsed so that you can see more sites in a
> few results.
>
> See:
> http://wiki.apache.org/solr/**FieldCollapsing
> https://cwiki.apache.org/**confluence/display/solr/**Result+Grouping
>
> -- Jack Krupansky
>
> -Original Message- From: Sai Gadde
> Sent: Thursday, September 05, 2013 2:27 AM
> To: solr-user@lucene.apache.org
> Subject: Tweaking boosts for more search results variety
>
>
> Our index is aggregated content from various sites on the web. We want good
> user experience by showing multiple sites in the search results. In our
> setup we are seeing most of the results from same site on the top.
>
> Here is some information regarding queries and schema
>site - String field. We have about 1000 sites in index
>sitetype - String field.  we have 3 site types
> omitNorms="true" for both the fields
>
> Doc count varies largely based on site and sitetype by a factor of 10 -
> 1000 times
> Total index size is about 5 million docs.
> Solr Version: 4.0
>
> In our queries we have a fixed and preferential boost for certain sites.
> sitetype has different and fixed boosts for 3 possible values. We turned
> off Inverse Document Frequency (IDF) for these boosts to work properly.
> Other text fields are boosted based on search keywords only.
>
> With this setup we often see a bunch of hits from a single site followed by
> next etc.,
> Is there any solution to see results from variety of sites and still keep
> the preferential boosts in place?
>


Re: Invalid Version when slave node pull replication from master node

2013-09-06 Thread Erick Erickson
Whoa! You should _not_ be using replication with SolrCloud. You can use
replication just fine with 4.4, just like you would have in 3.x say, but in
that case you should not be using the zkHost or zkRun parameters, should not
have a ZooKeeper ensemble running etc.

In SolrCloud, all updates are routed to all the nodes at index time,
otherwise
it couldn't support, say, NRT processing. This makes replication not only
unnecessary, but I wouldn't want to try to predict what problems that would
cause.

So keep a sharp distinction between running Solr 4x and SolrCloud. The
latter
is specifically enabled when you specify zkHost or zkRun when you start Solr
as per the SolrCloud page.

Best
Erick


On Wed, Sep 4, 2013 at 11:32 PM, YouPeng Yang wrote:

> Hi all
>I solve the problem by add the coreName explicitly according to
> http://wiki.apache.org/solr/SolrReplication#Replicating_solrconfig.xml.
>
>But I want to make sure about that is it necessary to set the coreName
> explicitly. Is there any SolrJ API to pull the replication on the slave
> node from the master node?
>
>
> regards
>
>
>
> 2013/9/5 YouPeng Yang 
>
> > Hi again
> >
> >   I'm  using Solr4.4.
> >
> >
> > 2013/9/5 YouPeng Yang 
> >
> >> HI solrusers
> >>
> >>I'm testing the replication within SolrCloud .
> >>I just uncomment the replication section separately on the master and
> >> slave node.
> >>The replication section setting on the  master node:
> >> 
> >>  commit
> >>  startup
> >>  schema.xml,stopwords.txt
> >>
> >>  and on the slave node:
> >>   
> >>  http://10.7.23.124:8080/solr/#/
> >>  00:00:50
> >>
> >>
> >>After startup, an Error comes out on the slave node :
> >> 80110110 [snapPuller-70-thread-1] ERROR
> >> org.apache.solr.handler.SnapPuller  ?.Master at:
> >> http://10.7.23.124:8080/solr/#/ is not available. Index fetch failed.
> >> Exception: Invalid version (expected 2, but 60) or the data in not in
> >> 'javabin' format
> >>
> >>
> >>  Could anyone help me to solve the problem ?
> >>
> >>
> >> regards
> >>
> >>
> >>
> >>
> >
>


RE: Store 2 dimensional array( of int values) in solr 4.0

2013-09-06 Thread A Geek
Hi,Thanks for the quick reply. Sure, please find below the details as per your 
query.
Essentially, I want to retrieve the doc through JSON [using JSON format as SOLR 
result output]and want JSON to pick the the data from the dataX field as a two 
dimensional array of ints. When I store the data as show below, it shows up in 
JSON array of strings where the internal array is basically shown as strings 
(because thats how the field is configured and I'm storing, not finding any 
other option). Following is the current JSON output that I'm able to fetch: 
"dataX":["[20130614, 2]","[20130615, 11]","[20130616, 1]","[20130617, 
1]","[20130619, 8]","[20130620, 5]","[20130623, 5]"]
whereas I want  to fetch the dataX as something like: 
"dataX":[[20130614, 2],[20130615, 11],[20130616, 1],[20130617, 1],[20130619, 
8],[20130620, 5],[20130623, 5]]
as can be seen, the dataX is essentially a 2D array where the internal array is 
of two ints, one being date and other being the count.
Please point me in the right direction. Appreciate your time.
Thanks.

> From: j...@basetechnology.com
> To: solr-user@lucene.apache.org
> Subject: Re: Store 2 dimensional array( of int values) in solr 4.0
> Date: Fri, 6 Sep 2013 08:44:06 -0400
> 
> First you need to tell us how you wish to use and query the data. That will 
> largely determine how the data must be stored. Give us a few example queries 
> of how you would like your application to be able to access the data.
> 
> Note that Lucene has only simple multivalued fields - no structure or 
> nesting within a single field other that a list of scalar values.
> 
> But you can always store a complex structure as a BSON blob or JSON string 
> if all you want is to store and retrieve it in its entirety without querying 
> its internal structure. And note that Lucene queries are field level - does 
> a field contain or match a scalar value.
> 
> -- Jack Krupansky
> 
> -Original Message- 
> From: A Geek
> Sent: Friday, September 06, 2013 7:10 AM
> To: solr user
> Subject: Store 2 dimensional array( of int values) in solr 4.0
> 
> hi All, I'm trying to store a 2 dimensional array in SOLR [version 4.0]. 
> Basically I've the following data:
> [[20121108, 1],[20121110, 7],[2012, 2],[20121112, 2]] ...
> 
> The inner array being used to keep some count say X for that particular day. 
> Currently, I'm using the following field to store this data:
>  multiValued="true"/>
> and I'm using python library pySolr to store the data. Currently the data 
> that gets stored looks like this(its array of strings)
> [20121108, 1][20121110, 
> 7][2012, 2][20121112, 2][20121113, 
> 2][20121116, 1]
> Is there a way, i can store the 2 dimensional array and the inner array can 
> contain int values, like the one shown in the beginning example, such that 
> the the final/stored data in SOLR looks something like: 
> 20121108  7  
>  20121110 12 
>  20121110 12 
> 
> Just a guess, I think for this case, we need to add one more field[the index 
> for instance], for each inner array which will again be multivalued (which 
> will store int values only)? How do I add the actual 2 dimensional array, 
> how to pass the inner arrays and how to store the full doc that contains 
> this 2 dimensional array. Please help me out sort this issue.
> Please share your views and point me in the right direction. Any help would 
> be highly appreciated.
> I found similar things on the web, but not the one I'm looking for: 
> http://lucene.472066.n3.nabble.com/Two-dimensional-array-in-Solr-schema-td4003309.html
> Thanks 
> 
  

Re: Solr substring search

2013-09-06 Thread Erick Erickson
Yah, you're getting away with it due to the small data size. As
your data grows, the underlying mechanisms have to enumerate
every term in the field in order to find terms that match so it
can get _very_ expensive with large data sets.

Best to bite the bullet early or, better yet, see if you really need
to support this use-case.

Best,
Erick


On Fri, Sep 6, 2013 at 2:58 AM, Alvaro Cabrerizo  wrote:

> Hi:
>
> I would start looking:
>
> http://docs.lucidworks.com/display/solr/The+Standard+Query+Parser
>
> And the
> org.apache.lucene.queryparser.flexible.standard.StandardQueryParser.java
>
> Hope it helps.
>
> On Thu, Sep 5, 2013 at 11:30 PM, Scott Schneider <
> scott_schnei...@symantec.com> wrote:
>
> > Hello,
> >
> > I'm trying to find out how Solr runs a query for "*foo*".  Google tells
> me
> > that you need to use NGramFilterFactory for that kind of substring
> search,
> > but I find that even with very simple fieldTypes, it just works.
>  (Perhaps
> > because I'm testing on very small data sets, Solr is willing to look
> > through all the keywords.)  e.g. This works on the tutorial.
> >
> > Can someone tell me exactly how this works and/or point me to the Lucene
> > code that implements this?
> >
> > Thanks,
> > Scott
> >
> >
>


Re: solrcloud shards backup/restoration

2013-09-06 Thread Mark Miller
Phone typing. The end should not say "don't hard commit" - it should say "do a 
hard commit and take a snapshot". 

Mark

Sent from my iPhone

On Sep 6, 2013, at 7:26 AM, Mark Miller  wrote:

> I don't know that it's too bad though - its always been the case that if you 
> do a backup while indexing, it's just going to get up to the last hard 
> commit. With SolrCloud that will still be the case. So just make sure you do 
> a hard commit right before taking the backup - yes, it might miss a few docs 
> in the tran log, but if you are taking a back up while indexing, you don't 
> have great precision in any case - you will roughly get a snapshot for around 
> that time - even without SolrCloud, if you are worried about precision and 
> getting every update into that backup, you want to stop indexing and commit 
> first. But if you just want a rough snapshot for around that time, in both 
> cases you can still just don't hard commit and take a snapshot. 
> 
> Mark
> 
> Sent from my iPhone
> 
> On Sep 6, 2013, at 1:13 AM, Shalin Shekhar Mangar  
> wrote:
> 
>> The replication handler's backup command was built for pre-SolrCloud.
>> It takes a snapshot of the index but it is unaware of the transaction
>> log which is a key component in SolrCloud. Hence unless you stop
>> updates, commit your changes and then take a backup, you will likely
>> miss some updates.
>> 
>> That being said, I'm curious to see how peer sync behaves when you try
>> to restore from a snapshot. When you say that you haven't been
>> successful in restoring, what exactly is the behaviour you observed?
>> 
>> On Fri, Sep 6, 2013 at 5:14 AM, Aditya Sakhuja  
>> wrote:
>>> Hello,
>>> 
>>> I was looking for a good backup / recovery solution for the solrcloud
>>> indexes. I am more looking for restoring the indexes from the index
>>> snapshot, which can be taken using the replicationHandler's backup command.
>>> 
>>> I am looking for something that works with solrcloud 4.3 eventually, but
>>> still relevant if you tested with a previous version.
>>> 
>>> I haven't been successful in have the restored index replicate across the
>>> new replicas, after I restart all the nodes, with one node having the
>>> restored index.
>>> 
>>> Is restoring the indexes on all the nodes the best way to do it ?
>>> --
>>> Regards,
>>> -Aditya Sakhuja
>> 
>> 
>> 
>> -- 
>> Regards,
>> Shalin Shekhar Mangar.


Re: solrcloud shards backup/restoration

2013-09-06 Thread Mark Miller
I don't know that it's too bad though - its always been the case that if you do 
a backup while indexing, it's just going to get up to the last hard commit. 
With SolrCloud that will still be the case. So just make sure you do a hard 
commit right before taking the backup - yes, it might miss a few docs in the 
tran log, but if you are taking a back up while indexing, you don't have great 
precision in any case - you will roughly get a snapshot for around that time - 
even without SolrCloud, if you are worried about precision and getting every 
update into that backup, you want to stop indexing and commit first. But if you 
just want a rough snapshot for around that time, in both cases you can still 
just don't hard commit and take a snapshot. 

Mark

Sent from my iPhone

On Sep 6, 2013, at 1:13 AM, Shalin Shekhar Mangar  
wrote:

> The replication handler's backup command was built for pre-SolrCloud.
> It takes a snapshot of the index but it is unaware of the transaction
> log which is a key component in SolrCloud. Hence unless you stop
> updates, commit your changes and then take a backup, you will likely
> miss some updates.
> 
> That being said, I'm curious to see how peer sync behaves when you try
> to restore from a snapshot. When you say that you haven't been
> successful in restoring, what exactly is the behaviour you observed?
> 
> On Fri, Sep 6, 2013 at 5:14 AM, Aditya Sakhuja  
> wrote:
>> Hello,
>> 
>> I was looking for a good backup / recovery solution for the solrcloud
>> indexes. I am more looking for restoring the indexes from the index
>> snapshot, which can be taken using the replicationHandler's backup command.
>> 
>> I am looking for something that works with solrcloud 4.3 eventually, but
>> still relevant if you tested with a previous version.
>> 
>> I haven't been successful in have the restored index replicate across the
>> new replicas, after I restart all the nodes, with one node having the
>> restored index.
>> 
>> Is restoring the indexes on all the nodes the best way to do it ?
>> --
>> Regards,
>> -Aditya Sakhuja
> 
> 
> 
> -- 
> Regards,
> Shalin Shekhar Mangar.


RE: SolrCloud 4.x hangs under high update volume

2013-09-06 Thread Markus Jelsma
Thanks!
 
-Original message-
> From:Erick Erickson 
> Sent: Friday 6th September 2013 16:20
> To: solr-user@lucene.apache.org
> Subject: Re: SolrCloud 4.x hangs under high update volume
> 
> Markus:
> 
> See: https://issues.apache.org/jira/browse/SOLR-5216
> 
> 
> On Wed, Sep 4, 2013 at 11:04 AM, Markus Jelsma
> wrote:
> 
> > Hi Mark,
> >
> > Got an issue to watch?
> >
> > Thanks,
> > Markus
> >
> > -Original message-
> > > From:Mark Miller 
> > > Sent: Wednesday 4th September 2013 16:55
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: SolrCloud 4.x hangs under high update volume
> > >
> > > I'm going to try and fix the root cause for 4.5 - I've suspected what it
> > is since early this year, but it's never personally been an issue, so it's
> > rolled along for a long time.
> > >
> > > Mark
> > >
> > > Sent from my iPhone
> > >
> > > On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt 
> > wrote:
> > >
> > > > Hey guys,
> > > >
> > > > I am looking into an issue we've been having with SolrCloud since the
> > > > beginning of our testing, all the way from 4.1 to 4.3 (haven't tested
> > 4.4.0
> > > > yet). I've noticed other users with this same issue, so I'd really
> > like to
> > > > get to the bottom of it.
> > > >
> > > > Under a very, very high rate of updates (2000+/sec), after 1-12 hours
> > we
> > > > see stalled transactions that snowball to consume all Jetty threads in
> > the
> > > > JVM. This eventually causes the JVM to hang with most threads waiting
> > on
> > > > the condition/stack provided at the bottom of this message. At this
> > point
> > > > SolrCloud instances then start to see their neighbors (who also have
> > all
> > > > threads hung) as down w/"Connection Refused", and the shards become
> > "down"
> > > > in state. Sometimes a node or two survives and just returns 503s "no
> > server
> > > > hosting shard" errors.
> > > >
> > > > As a workaround/experiment, we have tuned the number of threads sending
> > > > updates to Solr, as well as the batch size (we batch updates from
> > client ->
> > > > solr), and the Soft/Hard autoCommits, all to no avail. Turning off
> > > > Client-to-Solr batching (1 update = 1 call to Solr), which also did not
> > > > help. Certain combinations of update threads and batch sizes seem to
> > > > mask/help the problem, but not resolve it entirely.
> > > >
> > > > Our current environment is the following:
> > > > - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7.
> > > > - 3 x Zookeeper instances, external Java 7 JVM.
> > > > - 1 collection, 3 shards, 2 replicas (each node is a leader of 1 shard
> > and
> > > > a replica of 1 shard).
> > > > - Log4j 1.2 for Solr logs, set to WARN. This log has no movement on a
> > good
> > > > day.
> > > > - 5000 max jetty threads (well above what we use when we are healthy),
> > > > Linux-user threads ulimit is 6000.
> > > > - Occurs under Jetty 8 or 9 (many versions).
> > > > - Occurs under Java 1.6 or 1.7 (several minor versions).
> > > > - Occurs under several JVM tunings.
> > > > - Everything seems to point to Solr itself, and not a Jetty or Java
> > version
> > > > (I hope I'm wrong).
> > > >
> > > > The stack trace that is holding up all my Jetty QTP threads is the
> > > > following, which seems to be waiting on a lock that I would very much
> > like
> > > > to understand further:
> > > >
> > > > "java.lang.Thread.State: WAITING (parking)
> > > >at sun.misc.Unsafe.park(Native Method)
> > > >- parking to wait for  <0x0007216e68d8> (a
> > > > java.util.concurrent.Semaphore$NonfairSync)
> > > >at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> > > >at
> > > >
> > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
> > > >at
> > > >
> > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
> > > >at
> > > >
> > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
> > > >at java.util.concurrent.Semaphore.acquire(Semaphore.java:317)
> > > >at
> > > >
> > org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)
> > > >at
> > > >
> > org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:418)
> > > >at
> > > >
> > org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:368)
> > > >at
> > > >
> > org.apache.solr.update.SolrCmdDistributor.flushAdds(SolrCmdDistributor.java:300)
> > > >at
> > > >
> > org.apache.solr.update.SolrCmdDistributor.finish(SolrCmdDistributor.java:96)
> > > >at
> > > >
> > org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:462)
> > > >at
> > > >
> > org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1178)
> > > >at
> > > >
> > org.apache.solr.handler.ContentStreamHandl

RE: Solr 4.3 Startup with Multiple Cores Hangs on "Registering Core"

2013-09-06 Thread Austin Rasmussen
Thanks for clearing that up Erick.  The updateLog XML element isn't present in 
any of the solrconfig.xml files, so I don't believe this is enabled.  

I posted the directory listing of all of the core data directories in a prior 
post, but there are no files/folders found that contain "tlog" in the name of 
them.

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Friday, September 06, 2013 9:18 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr 4.3 Startup with Multiple Cores Hangs on "Registering Core"

bq: I'm actually not using the transaction log (or the 
NRTCachingDirectoryFactory); it's currently set up to use the 
MMapDirectoryFactory,

This isn't relevant to whether you're using the update log or not, this is just 
how the index is handled. Look for something in your solrconfig.xml
like:
 
  ${solr.ulog.dir:}


The other thing to check is if you have files in a "tlog" directory that's a 
sibling to your index directory as Hoss suggested.

You may well NOT have any transaction log, but it's something to check.



Facet Count and RegexTransformer>splitBy

2013-09-06 Thread Raheel Hasan
Hi guyz,

Just a quick question:

I have a field that has CSV values in the database. So I will use the
DataImportHandler and will index it using RegexTransformer's splitBy
attribute. However, since this is the first time I am doing it, I just
wanted to be sure if it will work for Facet Count?

For example:
>From "query" results (say this is the values in that field):
row 1 = 1,2,3,4
row 2 = 1,4,5,3
row 3 = 2,1,20,66
.
.
.
.
so facet count will get me:
'1' = 3 occurrence
'2' = 2 occur.
.
.
.and so on.





-- 
Regards,
Raheel Hasan


Re: Solr Cell Question

2013-09-06 Thread Erick Erickson
It's always frustrating when someone replies with "Why not do it
a completely different way?".  But I will anyway :).

There's no requirement at all that you send things to Solr to make
Solr Cel (aka Tika) do it's tricks. Since you're already in SolrJ
anyway, why not just parse on the client? This has the advantage
of allowing you to offload the Tika processing from Solr which can
be quite expensive. You can use the same Tika jars that come
with Solr or download whatever version from the Tika project
you want. That way, you can exercise much better control over
what's done.

Here's a skeletal program with indexing from a DB mixed in, but
it shouldn't be hard at all to pull the DB parts out.

http://searchhub.org/dev/2012/02/14/indexing-with-solrj/

FWIW,
Erick


On Thu, Sep 5, 2013 at 5:28 PM, Jamie Johnson  wrote:

> Is it possible to configure solr cell to only extract and store the body of
> a document when indexing?  I'm currently doing the following which I
> thought would work
>
> ModifiableSolrParams params = new ModifiableSolrParams();
>
>  params.set("defaultField", "content");
>
>  params.set("xpath", "/xhtml:html/xhtml:body/descendant::node()");
>
>  ContentStreamUpdateRequest up = new ContentStreamUpdateRequest(
> "/update/extract");
>
>  up.setParams(params);
>
>  FileStream f = new FileStream(new File(".."));
>
>  up.addContentStream(f);
>
> up.setAction(ACTION.COMMIT, true, true);
>
> solrServer.request(up);
>
>
> But the result of content is as follows
>
> 
> 
> null
> ISO-8859-1
> text/plain; charset=ISO-8859-1
> Just a little test
> 
>
>
> What I had hoped for was just
>
> 
> Just a little test
> 
>


RE: Regarding improving performance of the solr

2013-09-06 Thread Jean-Sebastien Vachon
Have you checked the hit ratio of the different caches? Try to tune them to get 
rid of all evictions if possible.

Tuning the size of the caches and warming you searcher can give you a pretty 
good improvement. You might want to check your analysis chain as well to see if 
you`re not doing anything that is not necessary.



> -Original Message-
> From: prabu palanisamy [mailto:pr...@serendio.com]
> Sent: September-06-13 4:55 AM
> To: solr-user@lucene.apache.org
> Subject: Regarding improving performance of the solr
> 
>  Hi
> 
> I am currently using solr -3.5.0,  indexed  wikipedia dump (50 gb) with java
> 1.6.
> I am searching the solr with text (which is actually twitter tweets) .
> Currently it takes average time of 210 millisecond for each post, out of which
> 200 millisecond is consumed by solr server (QTime).  I used the jconsole
> monitor tool.
> 
> The stats are
>Heap usage - 10-50Mb,
>No of threads - 10-20
>No of class- 3800,
>Cpu usage - 10-15%
> 
> Currently I am loading all the fields of the wikipedia.
> 
> I only need the freebase category and wikipedia category. I want to know
> how to optimize the solr server to improve the performance.
> 
> Could you please help me out in optimize the performance?
> 
> Thanks and Regards
> Prabu
> 
> -
> Aucun virus trouvé dans ce message.
> Analyse effectuée par AVG - www.avg.fr
> Version: 2013.0.3392 / Base de données virale: 3222/6640 - Date: 05/09/2013


Re: charfilter doesn't do anything

2013-09-06 Thread Andreas Owen
i've managed to get it working if i use the regexTransformer and string is on 
the same line in my tika entity. but when the string is multilined it isn't 
working even though i tried ?s to set the flag dotall.





then i tried it like this and i get a stackoverflow



in javascript this works but maybe because i only used a small string.



On 6. Sep 2013, at 2:55 PM, Jack Krupansky wrote:

> Is there any chance that your changed your schema since you indexed the data? 
> If so, re-index the data.
> 
> If a "*" query finds nothing, that implies that the default field is empty. 
> Are you sure the "df" parameter is set to the field containing your data? 
> Show us your request handler definition and a sample of your actual Solr 
> input (Solr XML or JSON?) so that we can see what fields are being populated.
> 
> -- Jack Krupansky
> 
> -Original Message- From: Andreas Owen
> Sent: Friday, September 06, 2013 4:01 AM
> To: solr-user@lucene.apache.org
> Subject: Re: charfilter doesn't do anything
> 
> the input string is a normal html page with the word Zahlungsverkehr in it 
> and my query is ...solr/collection1/select?q=*
> 
> On 5. Sep 2013, at 9:57 PM, Jack Krupansky wrote:
> 
>> And show us an input string and a query that fail.
>> 
>> -- Jack Krupansky
>> 
>> -Original Message- From: Shawn Heisey
>> Sent: Thursday, September 05, 2013 2:41 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: charfilter doesn't do anything
>> 
>> On 9/5/2013 10:03 AM, Andreas Owen wrote:
>>> i would like to filter / replace a word during indexing but it doesn't do 
>>> anything and i dont get a error.
>>> 
>>> in schema.xml i have the following:
>>> 
>>> >> multiValued="true"/>
>>> 
>>> 
>>> 
>>> 
>>> >> pattern="Zahlungsverkehr" replacement="ASDFGHJK" />
>>> 
>>> 
>>>  
>>> 
>>> my 2. question is where can i say that the expression is multilined like in 
>>> javascript i can use /m at the end of the pattern?
>> 
>> I don't know about your second question.  I don't know if that will be
>> possible, but I'll leave that to someone who's more expert than I.
>> 
>> As for the first question, here's what I have.  Did you reindex?  That
>> will be required.
>> 
>> http://wiki.apache.org/solr/HowToReindex
>> 
>> Assuming that you did reindex, are you trying to search for ASDFGHJK in
>> a field that contains more than just "Zahlungsverkehr"?  The keyword
>> tokenizer might not do what you expect - it tokenizes the entire input
>> string as a single token, which means that you won't be able to search
>> for single words in a multi-word field without wildcards, which are
>> pretty slow.
>> 
>> Note that both the pattern and replacement are case sensitive.  This is
>> how regex works.  You haven't used a lowercase filter, which means that
>> you won't be able to search for asdfghjk.
>> 
>> Use the analysis tab in the UI on your core to see what Solr does to
>> your field text.
>> 
>> Thanks,
>> Shawn 



Re: charfilter doesn't do anything

2013-09-06 Thread Jack Krupansky
Is there any chance that your changed your schema since you indexed the 
data? If so, re-index the data.


If a "*" query finds nothing, that implies that the default field is empty. 
Are you sure the "df" parameter is set to the field containing your data? 
Show us your request handler definition and a sample of your actual Solr 
input (Solr XML or JSON?) so that we can see what fields are being 
populated.


-- Jack Krupansky

-Original Message- 
From: Andreas Owen

Sent: Friday, September 06, 2013 4:01 AM
To: solr-user@lucene.apache.org
Subject: Re: charfilter doesn't do anything

the input string is a normal html page with the word Zahlungsverkehr in it 
and my query is ...solr/collection1/select?q=*


On 5. Sep 2013, at 9:57 PM, Jack Krupansky wrote:


And show us an input string and a query that fail.

-- Jack Krupansky

-Original Message- From: Shawn Heisey
Sent: Thursday, September 05, 2013 2:41 PM
To: solr-user@lucene.apache.org
Subject: Re: charfilter doesn't do anything

On 9/5/2013 10:03 AM, Andreas Owen wrote:
i would like to filter / replace a word during indexing but it doesn't do 
anything and i dont get a error.


in schema.xml i have the following:

multiValued="true"/>




 
 pattern="Zahlungsverkehr" replacement="ASDFGHJK" />

 

  

my 2. question is where can i say that the expression is multilined like 
in javascript i can use /m at the end of the pattern?


I don't know about your second question.  I don't know if that will be
possible, but I'll leave that to someone who's more expert than I.

As for the first question, here's what I have.  Did you reindex?  That
will be required.

http://wiki.apache.org/solr/HowToReindex

Assuming that you did reindex, are you trying to search for ASDFGHJK in
a field that contains more than just "Zahlungsverkehr"?  The keyword
tokenizer might not do what you expect - it tokenizes the entire input
string as a single token, which means that you won't be able to search
for single words in a multi-word field without wildcards, which are
pretty slow.

Note that both the pattern and replacement are case sensitive.  This is
how regex works.  You haven't used a lowercase filter, which means that
you won't be able to search for asdfghjk.

Use the analysis tab in the UI on your core to see what Solr does to
your field text.

Thanks,
Shawn 




Re: Restrict Parsing duplicate file in Solr

2013-09-06 Thread Jack Krupansky
Explain what you mean by restring duplicate file indexing. Solr doesn't work 
at the "file" level - only documents (rows or records) and fields and 
values.


-- Jack Krupansky

-Original Message- 
From: shabbir

Sent: Friday, September 06, 2013 12:24 AM
To: solr-user@lucene.apache.org
Subject: Restrict Parsing duplicate file in Solr

Hi I am new to Solr , I am looking for option of restricting duplicate file
indexing in solr.Please let me know if it can be done with any configuration
change.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Restrict-Parsing-duplicate-file-in-Solr-tp4088471.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: Store 2 dimensional array( of int values) in solr 4.0

2013-09-06 Thread Jack Krupansky
First you need to tell us how you wish to use and query the data. That will 
largely determine how the data must be stored. Give us a few example queries 
of how you would like your application to be able to access the data.


Note that Lucene has only simple multivalued fields - no structure or 
nesting within a single field other that a list of scalar values.


But you can always store a complex structure as a BSON blob or JSON string 
if all you want is to store and retrieve it in its entirety without querying 
its internal structure. And note that Lucene queries are field level - does 
a field contain or match a scalar value.


-- Jack Krupansky

-Original Message- 
From: A Geek

Sent: Friday, September 06, 2013 7:10 AM
To: solr user
Subject: Store 2 dimensional array( of int values) in solr 4.0

hi All, I'm trying to store a 2 dimensional array in SOLR [version 4.0]. 
Basically I've the following data:

[[20121108, 1],[20121110, 7],[2012, 2],[20121112, 2]] ...

The inner array being used to keep some count say X for that particular day. 
Currently, I'm using the following field to store this data:
multiValued="true"/>
and I'm using python library pySolr to store the data. Currently the data 
that gets stored looks like this(its array of strings)
[20121108, 1][20121110, 
7][2012, 2][20121112, 2][20121113, 
2][20121116, 1]
Is there a way, i can store the 2 dimensional array and the inner array can 
contain int values, like the one shown in the beginning example, such that 
the the final/stored data in SOLR looks something like: 

20121108  7  
 20121110 12 
 20121110 12 

Just a guess, I think for this case, we need to add one more field[the index 
for instance], for each inner array which will again be multivalued (which 
will store int values only)? How do I add the actual 2 dimensional array, 
how to pass the inner arrays and how to store the full doc that contains 
this 2 dimensional array. Please help me out sort this issue.
Please share your views and point me in the right direction. Any help would 
be highly appreciated.
I found similar things on the web, but not the one I'm looking for: 
http://lucene.472066.n3.nabble.com/Two-dimensional-array-in-Solr-schema-td4003309.html
Thanks 



SOLR 3.6.1 auto complete sorting

2013-09-06 Thread Poornima Jay
Hi, 

We had implemented Auto Complete feature in our site. Below are the solr config 
details.

schema.xml

 
         
            
            
            
            
            
         
         
            
            
            
            
         
      



 




 
   
   
   
  
solrquery is  
q=ph_su%3Aepub+&start=0&rows=10&fl=dams_id&wt=json&indent=on&hl=true&hl.fl=ph_su&hl.simple.pre=&hl.simple.post=

the requirement is to sort the results based on releavance and latest published 
products for the search term.

I have the below parameters but nothing worked

sort = dams_id desc,published_date desc
order_by = dams_id desc,published_date desc

Please let me know how to sort the results with relevance and published date 
descending.

Thanks,
Poornima


Re: Solr documents update on index

2013-09-06 Thread Luís Portela Afonso
Hi,

But i'm indexing rss feeds. I want that solr indexes that without change the 
existing information of a document with the same uniqueKey.
The best approach is that solr updates the doc if changes are detected, but i 
can leave without that.

I really would like that solr does not update the document if it already exists.

I'm using the DataImportScheduler to solr index launch the scheduled index.

Appreciate any possible help.

On Sep 6, 2013, at 9:16 AM, Shalin Shekhar Mangar  
wrote:

> Yes, if a document with the same key exists, then the old document
> will be deleted and replaced with the new document. You can also
> partially update documents (we call it atomic updates) which reads the
> old document from local index, updates it according to the request and
> then replaces the old document with the new one.
> 
> See 
> https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Index+Handlers#UploadingDatawithIndexHandlers-UpdatingOnlyPartofaDocument
> 
> On Fri, Sep 6, 2013 at 1:03 AM, Luis Portela Afonso
>  wrote:
>> Hi,
>> 
>> I'm having a problem when solr indexes.
>> It is updating documents already indexed. Is this a normal behavior?
>> If a document with the same key already exists is it supposed to be updated?
>> I has thinking that is supposed to just update if the information on the
>> rss has changed.
>> 
>> Appreciate your help
>> 
>> --
>> Sent from Gmail Mobile
> 
> 
> 
> -- 
> Regards,
> Shalin Shekhar Mangar.



smime.p7s
Description: S/MIME cryptographic signature


Store 2 dimensional array( of int values) in solr 4.0

2013-09-06 Thread A Geek
hi All, I'm trying to store a 2 dimensional array in SOLR [version 4.0]. 
Basically I've the following data: 
[[20121108, 1],[20121110, 7],[2012, 2],[20121112, 2]] ...

The inner array being used to keep some count say X for that particular day. 
Currently, I'm using the following field to store this data: 

and I'm using python library pySolr to store the data. Currently the data that 
gets stored looks like this(its array of strings)
[20121108, 1][20121110, 
7][2012, 2][20121112, 2][20121113, 
2][20121116, 1]
Is there a way, i can store the 2 dimensional array and the inner array can 
contain int values, like the one shown in the beginning example, such that the 
the final/stored data in SOLR looks something like: 
20121108  7  
 20121110 12 
 20121110 12 

Just a guess, I think for this case, we need to add one more field[the index 
for instance], for each inner array which will again be multivalued (which will 
store int values only)? How do I add the actual 2 dimensional array, how to 
pass the inner arrays and how to store the full doc that contains this 2 
dimensional array. Please help me out sort this issue.
Please share your views and point me in the right direction. Any help would be 
highly appreciated. 
I found similar things on the web, but not the one I'm looking for: 
http://lucene.472066.n3.nabble.com/Two-dimensional-array-in-Solr-schema-td4003309.html
Thanks

Restrict Parsing duplicate file in Solr

2013-09-06 Thread shabbir
Hi I am new to Solr , I am looking for option of restricting duplicate file
indexing in solr.Please let me know if it can be done with any configuration
change.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Restrict-Parsing-duplicate-file-in-Solr-tp4088471.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: bucket count for facets

2013-09-06 Thread Steven Bower
Understood, what I need is a count of the unique values in a field and that
field is multi-valued (which makes stats component a non-option)


On Fri, Sep 6, 2013 at 4:22 AM, Shalin Shekhar Mangar <
shalinman...@gmail.com> wrote:

> Stats Component can give you a count of non-null values in a field.
>
> See https://cwiki.apache.org/confluence/display/solr/The+Stats+Component
>
> On Fri, Sep 6, 2013 at 12:28 AM, Steven Bower 
> wrote:
> > Is there a way to get the count of buckets (ie unique values) for a field
> > facet? the rudimentary approach of course is to get back all buckets, but
> > in some cases this is a huge amount of data.
> >
> > thanks,
> >
> > steve
>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>


Re: How to config SOLR server for spell check functionality

2013-09-06 Thread Shalin Shekhar Mangar
On Wed, Sep 4, 2013 at 4:56 PM, sebastian.manolescu
 wrote:
> I want to implement spell check functionality offerd by solr using MySql
> database, but I dont understand how.
> Here the basic flow of what I want to do.
>
> I have a simple inputText (in jsf) and if I type the word shwo the response
> to OutputLabel should be show.
>
> First of all I'm using the following tools and frameworks:
>
> JBoss application server 6.1.
> Eclipse
> JPA
> JSF(Primefaces)
>
> Steps I've done until now:
>
> Step 1: Download solr server from:
> http://lucene.apache.org/solr/downloads.html Extract content.
>
> Step 2: Add to Envoierment variable:
>
> Variable name: solr.solr.home Variable value :
> D:\JBOSS\solr-4.4.0\solr-4.4.0\example\solr --- where you have the solr
> server
>
> Step 3:
>
> Open solr war and to solr.war\WEB-INF\web.xml add env-entry - (the easy way)
>
> solr/home D:\JBOSS\solr-4.4.0\solr-4.4.0\example\solr java.lang.String
>
> OR import project change and bulid war.
>
> Step 4: Browser: localhost:8080/solr/
>
> And the solr console appears.
>
> Until now all works well.
>
> I have found some usefull code (my opinion) that returns:
>
> [collection1] webapp=/solr path=/spell
> params={spellcheck=on&q=whatever&wt=javabin&qt=/spell&version=2&spellcheck.build=true}
> hits=0 status=0 QTime=16
>
> Here is the code that gives the result from above:
>
> SolrServer solr;
> try {
> solr = new CommonsHttpSolrServer("http://localhost:8080/solr";);
>
> ModifiableSolrParams params = new ModifiableSolrParams();
> params.set("qt", "/spell");
> params.set("q", "whatever");
> params.set("spellcheck", "on");
> params.set("spellcheck.build", "true");
>
> QueryResponse response = solr.query(params);
> SpellCheckResponse spellCheckResponse =
> response.getSpellCheckResponse();
> if (!spellCheckResponse.isCorrectlySpelled()) {
> for (Suggestion suggestion :
> response.getSpellCheckResponse().getSuggestions()) {
>System.out.println("original token: " + suggestion.getToken() + "
> - alternatives: " + suggestion.getAlternatives());
> }
> }
> } catch (Exception e) {
> // TODO Auto-generated catch block
> e.printStackTrace();
> }
>
> Questions:
>
> 1.How do I make the database connection whit my DB and search the content to
> see if there are any words that could match?

You can either write SolrJ code to index data into Solr or you can use
DataImportHandler.

http://wiki.apache.org/solr/DIHQuickStart
http://wiki.apache.org/solr/DataImportHandler

> 2.How do I make the configuration.(solr-config.xml,shema.xml...etc)?

You must first edit the schema.xml according to your data. See
https://cwiki.apache.org/confluence/display/solr/Documents%2C+Fields%2C+and+Schema+Design

> 3.How do I send a string from my view(xhtml) so that the solr server knows
> what he looks for?

For search, you can use the SolrJ java client.

https://cwiki.apache.org/confluence/display/solr/Searching
http://wiki.apache.org/solr/Solrj#Reading_Data_from_Solr

You seem to have done your homework and have found most of the
resources. We will be able to help you in a better way if you asked
specific questions instead.

-- 
Regards,
Shalin Shekhar Mangar.


Re: Questions about Replication Factor on solrcloud

2013-09-06 Thread Shalin Shekhar Mangar
Comments inline:

On Wed, Sep 4, 2013 at 10:38 PM, Lisandro Montaño
 wrote:
> Hi all,
>
>
>
> I’m currently working on deploying a solrcloud distribution in centos
> machines and wanted to have more guidance about Replication Factor
> configuration.
>
>
>
> I have configured two servers with solrcloud over tomcat and a third server
> as zookeeper. I have configured successfully and have one server with
> collection1 available and the other with collection1_Shard1_Replica1.
>

How did you configure them this way? In particular, I'm confused as to
why there is collection1 on the first node and
collection1_Shard1_Replica1 on the other.

>
>
> My questions are:
>
>
>
> -  Can I have 1 shard and 2 replicas on two machines? What are the
> limitations or considerations to define this?

Yes you can have 1 shard and 2 replicas, one each on your two
machines. That is the way it is configured by default. For example,
this can be achieved if you create another collection
(numShards=1&replicationFactor=2) using the collection API.

>
> -  How does replica works? (there is not too much info about it)

All replicas (physical shards) are peers who decide on a leader using
ZooKeeper. All updates are routed via the leader who forwards
(versioned) updates to other replicas. A query can be served by any
replica. If a replica goes down, then it will attempt to recover from
the current leader and then start serving requests. If the leader goes
down, then all the other replicas (after waiting for a certain time
for the old leader to come back) decide on a new leader.

>
> -  When I import data on collection1 it works properly, but when I
> do it in collection1_Shard1_Replica1 it fails. Is that an expected behavior?
> (Maybe if I have a better definition of replica’s I will understand it
> better)
>

Can you describe how it fails? Stack traces or excerpts from the Solr
logs will help.
-- 
Regards,
Shalin Shekhar Mangar.


Regarding improving performance of the solr

2013-09-06 Thread prabu palanisamy
 Hi

I am currently using solr -3.5.0,  indexed  wikipedia dump (50 gb) with
java 1.6.
I am searching the solr with text (which is actually twitter tweets) .
Currently it takes average time of 210 millisecond for each post, out of
which 200 millisecond is consumed by solr server (QTime).  I used the
jconsole monitor tool.

The stats are
   Heap usage - 10-50Mb,
   No of threads - 10-20
   No of class- 3800,
   Cpu usage - 10-15%

Currently I am loading all the fields of the wikipedia.

I only need the freebase category and wikipedia category. I want to know
how to optimize the solr server to improve the performance.

Could you please help me out in optimize the performance?

Thanks and Regards
Prabu


Re: Loading a SpellCheck dynamically

2013-09-06 Thread Shalin Shekhar Mangar
My guess is that you have a single request handler defined with all
your language specific spell check components. This is why you see
spellcheck values from all spellcheckers.

If the above is true, then I don't think there is a way to choose one
specific spellchecker component. The alternative is to define multiple
request handlers with one-to-one mapping with the spell check
components. Then you can send a request to one particular request
handler and the corresponding spell check component will return its
response.

On Thu, Sep 5, 2013 at 11:29 PM, Mr Havercamp  wrote:
> I currently have multiple spellchecks configured in my solrconfig.xml to
> handle a variety of different spell suggestions in different languages.
>
> In the snippet below, I have a catch-all spellcheck as well as an English
> only one for more accurate matching (I.e. my schema.xml is set up to capture
> english only fields to an english-specific textSpell_en field and then I
> also capture to a generic textSpell field):
>
> ---solrconfig.xml---
>
> 
> textSpell_en
>
> 
> default
> spell_en
> ./spellchecker_en
> true
> 
> 
>
> 
> textSpell
>
> 
> default
> spell
> ./spellchecker
> true
> 
> 
>
> My question is; when I query my Solr index, am I able to load, say, just
> spellcheck values from the spellcheck_en spellchecker rather than from both?
> This would be useful if I were to start implementing additional language
> spellchecks; E.g. spellcheck_ja, spellcheck_fr, etc.
>
> Thanks for any insights.
>
> Cheers
>
>
> Hayden



-- 
Regards,
Shalin Shekhar Mangar.


monitoring Solr RAM with graphite

2013-09-06 Thread Dmitry Kan
Hello!

I remember some time ago people were interested in how Solr instances can
be monitored with graphite. This blog post gives a hands-on example from my
experience of monitoring RAM usage of Solr.

http://dmitrykan.blogspot.fi/2013/09/monitoring-solr-with-graphite-and-carbon.html

Please note, that this is not SOLR native monitoring, i.e. SOLR is more
like a black box. It can still suffice to a persistent monitoring need.

Further stats can be added with querying SOLR for cache usage and so on.

Regards,

Dmitry Kan


Regarding reducing qtime

2013-09-06 Thread prabu palanisamy
Hi

I am currently using solr -3.5.0 indexed by wikipedia dump (50 gb) with
java 1.6. I am searching the tweets in the solr. Currently it takes average
of 210 millisecond for each post, out of which 200 millisecond is consumed
by solr server (QTime).   I used the jconsole mointor tool, The report are
   heap usage of 10-50Mb,
   No of threads - 10-20
   No of class around 3800,


Re: Odd behavior after adding an additional core.

2013-09-06 Thread Shalin Shekhar Mangar
Can you give exact steps to reproduce this problem?

Also, are you sure you supplied numShards=4 while creating the collection?

On Fri, Sep 6, 2013 at 12:20 AM, mike st. john  wrote:
> using solr 4.4  , i used collection admin to create a collection  4shards
> replication - factor of 1
>
> i did this so i could index my data, then bring in replicas later by adding
> cores via coreadmin
>
>
> i added a new core via coreadmin,  what i noticed shortly after adding the
> core,  the leader of the shard where the new replica was placed was marked
> active the new core marked as the leader  and the routing was now set to
> implicit.
>
>
>
> i've replicated this on another solr setup as well.
>
>
> Any ideas?
>
>
> Thanks
>
> msj



-- 
Regards,
Shalin Shekhar Mangar.


Re: bucket count for facets

2013-09-06 Thread Shalin Shekhar Mangar
Stats Component can give you a count of non-null values in a field.

See https://cwiki.apache.org/confluence/display/solr/The+Stats+Component

On Fri, Sep 6, 2013 at 12:28 AM, Steven Bower  wrote:
> Is there a way to get the count of buckets (ie unique values) for a field
> facet? the rudimentary approach of course is to get back all buckets, but
> in some cases this is a huge amount of data.
>
> thanks,
>
> steve



-- 
Regards,
Shalin Shekhar Mangar.


Re: Solr documents update on index

2013-09-06 Thread Shalin Shekhar Mangar
Yes, if a document with the same key exists, then the old document
will be deleted and replaced with the new document. You can also
partially update documents (we call it atomic updates) which reads the
old document from local index, updates it according to the request and
then replaces the old document with the new one.

See 
https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Index+Handlers#UploadingDatawithIndexHandlers-UpdatingOnlyPartofaDocument

On Fri, Sep 6, 2013 at 1:03 AM, Luis Portela Afonso
 wrote:
> Hi,
>
> I'm having a problem when solr indexes.
> It is updating documents already indexed. Is this a normal behavior?
> If a document with the same key already exists is it supposed to be updated?
> I has thinking that is supposed to just update if the information on the
> rss has changed.
>
> Appreciate your help
>
> --
> Sent from Gmail Mobile



-- 
Regards,
Shalin Shekhar Mangar.


Re: solrcloud shards backup/restoration

2013-09-06 Thread Shalin Shekhar Mangar
The replication handler's backup command was built for pre-SolrCloud.
It takes a snapshot of the index but it is unaware of the transaction
log which is a key component in SolrCloud. Hence unless you stop
updates, commit your changes and then take a backup, you will likely
miss some updates.

That being said, I'm curious to see how peer sync behaves when you try
to restore from a snapshot. When you say that you haven't been
successful in restoring, what exactly is the behaviour you observed?

On Fri, Sep 6, 2013 at 5:14 AM, Aditya Sakhuja  wrote:
> Hello,
>
> I was looking for a good backup / recovery solution for the solrcloud
> indexes. I am more looking for restoring the indexes from the index
> snapshot, which can be taken using the replicationHandler's backup command.
>
> I am looking for something that works with solrcloud 4.3 eventually, but
> still relevant if you tested with a previous version.
>
> I haven't been successful in have the restored index replicate across the
> new replicas, after I restart all the nodes, with one node having the
> restored index.
>
> Is restoring the indexes on all the nodes the best way to do it ?
> --
> Regards,
> -Aditya Sakhuja



-- 
Regards,
Shalin Shekhar Mangar.


Re: unknown _stream_source_info while indexing rich doc in solr

2013-09-06 Thread Nutan
I will try this,thanks



--
View this message in context: 
http://lucene.472066.n3.nabble.com/unknown-stream-source-info-while-indexing-rich-doc-in-solr-tp4088136p4088490.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: charfilter doesn't do anything

2013-09-06 Thread Andreas Owen
the input string is a normal html page with the word Zahlungsverkehr in it and 
my query is ...solr/collection1/select?q=*

On 5. Sep 2013, at 9:57 PM, Jack Krupansky wrote:

> And show us an input string and a query that fail.
> 
> -- Jack Krupansky
> 
> -Original Message- From: Shawn Heisey
> Sent: Thursday, September 05, 2013 2:41 PM
> To: solr-user@lucene.apache.org
> Subject: Re: charfilter doesn't do anything
> 
> On 9/5/2013 10:03 AM, Andreas Owen wrote:
>> i would like to filter / replace a word during indexing but it doesn't do 
>> anything and i dont get a error.
>> 
>> in schema.xml i have the following:
>> 
>> > multiValued="true"/>
>> 
>> 
>> 
>>  
>>  > pattern="Zahlungsverkehr" replacement="ASDFGHJK" />
>>  
>> 
>>   
>> 
>> my 2. question is where can i say that the expression is multilined like in 
>> javascript i can use /m at the end of the pattern?
> 
> I don't know about your second question.  I don't know if that will be
> possible, but I'll leave that to someone who's more expert than I.
> 
> As for the first question, here's what I have.  Did you reindex?  That
> will be required.
> 
> http://wiki.apache.org/solr/HowToReindex
> 
> Assuming that you did reindex, are you trying to search for ASDFGHJK in
> a field that contains more than just "Zahlungsverkehr"?  The keyword
> tokenizer might not do what you expect - it tokenizes the entire input
> string as a single token, which means that you won't be able to search
> for single words in a multi-word field without wildcards, which are
> pretty slow.
> 
> Note that both the pattern and replacement are case sensitive.  This is
> how regex works.  You haven't used a lowercase filter, which means that
> you won't be able to search for asdfghjk.
> 
> Use the analysis tab in the UI on your core to see what Solr does to
> your field text.
> 
> Thanks,
> Shawn 



Re: Solr substring search

2013-09-06 Thread Alvaro Cabrerizo
Hi:

I would start looking:

http://docs.lucidworks.com/display/solr/The+Standard+Query+Parser

And the
org.apache.lucene.queryparser.flexible.standard.StandardQueryParser.java

Hope it helps.

On Thu, Sep 5, 2013 at 11:30 PM, Scott Schneider <
scott_schnei...@symantec.com> wrote:

> Hello,
>
> I'm trying to find out how Solr runs a query for "*foo*".  Google tells me
> that you need to use NGramFilterFactory for that kind of substring search,
> but I find that even with very simple fieldTypes, it just works.  (Perhaps
> because I'm testing on very small data sets, Solr is willing to look
> through all the keywords.)  e.g. This works on the tutorial.
>
> Can someone tell me exactly how this works and/or point me to the Lucene
> code that implements this?
>
> Thanks,
> Scott
>
>