Re: Indexing speed reduced significantly with OCR

2017-03-27 Thread Zheng Lin Edwin Yeo
Yes, the sample document sizes are not very big. And also, the sample documents have a mixture of documents that consists of inline images, and also documents which are searchable (text extractable without OCR) I suppose only those documents which requires OCR will slow down the indexing? Which

RE: Indexing speed reduced significantly with OCR

2017-03-27 Thread Phil Scadden
Only by 10? You must have quite small documents. OCR is extremely expensive process. Indexing is trivial by comparison. For quite large documents I am working with OCR can be 100 times slower than indexing a PDF that is searchable (text extractable without OCR). -Original Message-

Indexing speed reduced significantly with OCR

2017-03-27 Thread Zheng Lin Edwin Yeo
Hi, Does the indexing speed of Solr reduced significantly when we are using Tesseract OCR to extract scanned inline images from PDF? I found that after I implement the solution to extract those scanned images from PDF, the indexing speed is now slower by almost more than 10 times. I'm using

Re: OCR not working occasionally

2017-03-27 Thread Zheng Lin Edwin Yeo
I have found this solution in Stackoverflow from Tim Allison to be working. http://stackoverflow.com/questions/32354209/apache- tika-extract-scanned-pdf-files Regards, Edwin On 19 March 2017 at 19:47, Zheng Lin Edwin Yeo wrote: > This is my settings in the

Closed connection issue while doing dataimport

2017-03-27 Thread santosh sidnal
Hi All, i am facing closed connection issue while doing dataimporter, any solution to this> stack trace is as below [3/27/17 8:54:41:399 CDT] 00b4 OracleDataSto > findMappingClass for : Entry java.sql.SQLRecoverableException: Closed Connection at

Re: AW: Newbie in Solr

2017-03-27 Thread Shawn Heisey
On 3/27/2017 1:35 PM, Ercan Karadeniz wrote: > is my understanding correct when I use the "managed-schema" file for > the Solr confguration, then it is NOT running in schema less mode, > correct? Impossible to say from the info provided. The managed schema is required for schemaless mode, but

Re: Licensing issue advice for Solr.

2017-03-27 Thread Shawn Heisey
On 3/24/2017 11:53 AM, russell.lemas...@comcast.net wrote: > I'm just getting started with Solr (6.4.2) and am trying to get > approval for usage in my workplace. I know that the product in general > is licensed as Apache 2.0, but unfortunately there are packages > included in the build that are

Re: losing records during solr updates

2017-03-27 Thread Shawn Feldman
This update seems suspicious, the adds with the same id seem like a closure issue in the retry. --- solr1_1 | 2017-03-27 20:19:12.397 INFO (qtp575335780-17) [c:goseg s:shard24 r:core_node12 x:goseg_shard24_replica2] o.a.s.u.p.LogUpdateProcessorFactory [goseg_shard24_replica2] webapp=/solr

Re: losing records during solr updates

2017-03-27 Thread Shawn Feldman
Here is the solr log of our test node restarting https://s3.amazonaws.com/uploads.hipchat.com/17705/1138911/fvKS3t5uAnoi0pP/solrlog.txt On Mon, Mar 27, 2017 at 2:10 PM Shawn Feldman wrote: > Ercan, I think you responded to the wrong thread > > On Mon, Mar 27, 2017 at

Re: losing records during solr updates

2017-03-27 Thread Shawn Feldman
Ercan, I think you responded to the wrong thread On Mon, Mar 27, 2017 at 2:02 PM Ercan Karadeniz wrote: > 6.4.2 (latest available) or shall I use another one for familiarization > purposes? > > > > Von: Alexandre Rafalovitch

AW: losing records during solr updates

2017-03-27 Thread Ercan Karadeniz
6.4.2 (latest available) or shall I use another one for familiarization purposes? Von: Alexandre Rafalovitch Gesendet: Montag, 27. März 2017 21:28 An: solr-user Betreff: Re: losing records during solr updates What version of Solr is it?

Re: losing records during solr updates

2017-03-27 Thread Shawn Feldman
6.4.2 On Mon, Mar 27, 2017 at 1:29 PM Alexandre Rafalovitch wrote: > What version of Solr is it? > > Regards, >Alex. > > http://www.solr-start.com/ - Resources for Solr users, new and experienced > > > On 27 March 2017 at 15:25, Shawn Feldman

Unexplainable indexing i/o errors

2017-03-27 Thread simon
I'm seeing an odd error during indexing for which I can't find any reason. The relevant solr log entry: 2017-03-24 19:09:35.363 ERROR (commitScheduler-30-thread-1) [ x:build0324] o.a.s.u.CommitTracker auto commit error...:java.io.EOFException: read past EOF:

AW: Newbie in Solr

2017-03-27 Thread Ercan Karadeniz
Hi Alexandre, is my understanding correct when I use the "managed-schema" file for the Solr confguration, then it is NOT running in schema less mode, correct? Regards, Ercan Von: Alexandre Rafalovitch Gesendet: Freitag, 24. März 2017

Re: losing records during solr updates

2017-03-27 Thread Shawn Feldman
We are also hard committing at 15 sec and soft committing at 30 sec. I've found if we change syncLevel to fsync then we don't lose any data On Mon, Mar 27, 2017 at 1:30 PM Shawn Feldman wrote: > 6.4.2 > > On Mon, Mar 27, 2017 at 1:29 PM Alexandre Rafalovitch

Re: losing records during solr updates

2017-03-27 Thread Alexandre Rafalovitch
What version of Solr is it? Regards, Alex. http://www.solr-start.com/ - Resources for Solr users, new and experienced On 27 March 2017 at 15:25, Shawn Feldman wrote: > When we restart solr on a leader node while we are doing updates, we've > noticed that some

losing records during solr updates

2017-03-27 Thread Shawn Feldman
When we restart solr on a leader node while we are doing updates, we've noticed that some small percentage of data is lost. maybe 9 records out of 1k. Updating using min_rf=3 or full quorum seems to resolve this since our rf = 3. Updates then seem to only succeed when all nodes are back up. Why

Re: Solr Delete By Id Out of memory issue

2017-03-27 Thread Rohit Kanchan
Thanks Erick for replying back. I have deployed changes to production, we will figure it out soon if it is still causing OOM or not. And for commits we are doing auto commits after 10K docs or 30 secs. If I get time I will try to run a local test to check if we will hit OOM because of 1K map

Re: Solr Delete By Id Out of memory issue

2017-03-27 Thread Erick Erickson
Rohit: Well, whenever I see something like "I have this custom component..." I immediately want the problem to be demonstrated without that custom component before trying to debug Solr. As Chris explained, we can't clear the 1K entries. It's hard to imagine why keeping the last 1,000 entries

RE: Index scanned documents

2017-03-27 Thread Allison, Timothy B.
See also: http://stackoverflow.com/a/39792337/6281268 This includes jai. Most importantly: be aware of the licensing implications of using levigo and jai. If they had been Apache 2.0 compatible, we would have included them. Finally, there's a new option (coming out in Tika 1.15) that renders

Re: Index scanned documents

2017-03-27 Thread Zheng Lin Edwin Yeo
I tried this solution from Tim Allison, and it works. http://stackoverflow.com/questions/32354209/apache-tika-extract-scanned-pdf-files Regards, Edwin On 27 March 2017 at 20:07, Allison, Timothy B. wrote: > Please also see: > > https://wiki.apache.org/tika/TikaOCR > > and

Re: Schema API: Modify Unique Key

2017-03-27 Thread Shawn Heisey
On 3/27/2017 7:05 AM, nabil Kouici wrote: > We're going to use Solr in our organization (under test) and we want > to set the primary key through schema API, which is not allowed today. > Is this function planned to be implemented in Solr? If yes, do you > have any idea in which version? Steve

Re: Streaming expressions - Any plans to add one to many fetches to the fetch decorator?

2017-03-27 Thread Joel Bernstein
Yes, one to many fetches will be implemented. At the moment there isn't a workaround that I can think of. If you decide to work on a patch for fetch I'll review the patch. Joel Bernstein http://joelsolr.blogspot.com/ On Mon, Mar 27, 2017 at 2:33 PM, adfel70 wrote: > Any

Re: Multi word synonyms

2017-03-27 Thread Doug Turnbull
Fntastic! On Mon, Mar 27, 2017 at 9:56 AM alessandro.benedetti wrote: > In addition to what Doug has already pointed out, i would like to highlight > this contribution in Solr 6.5.0 . > It may seem like a small innocent patch but it actually open a new worlds > for one

Re: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://10.175.190.72:8999/solr/product: Rollback is currently not supported in SolrCloud mode. (SOLR-4895

2017-03-27 Thread Shawn Heisey
On 3/27/2017 4:37 AM, Mikhail Ibraheem wrote: > Any help please? > > -Original Message- > From: Mikhail Ibraheem > Sent: 26 مارس, 2017 10:22 م > To: solr-user@lucene.apache.org > Subject: > org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error > from server at

Re: Multi word synonyms

2017-03-27 Thread alessandro.benedetti
In addition to what Doug has already pointed out, i would like to highlight this contribution in Solr 6.5.0 . It may seem like a small innocent patch but it actually open a new worlds for one of the most controversial aspects of Solr Query Parsing : http://issues.apache.org/jira/browse/SOLR-9185

Re: Schema API: Modify Unique Key

2017-03-27 Thread Steve Rowe
Hi Nabil, There is an open JIRA issue to implement this functionality, but I haven’t had a chance to work on it recently: . Consequently, I’m not sure which release will have it. Patches welcome! -- Steve www.lucidworks.com > On Mar 27,

Re: Version upgrading approaches

2017-03-27 Thread alessandro.benedetti
Based on what I noticed so far, the strongest drive for a migration is a new feature coming/ bugfix coming. It's usually the only way to convince the business layer in small/mid size companies not tech oriented. In general I would say it is quite important to avoid lagging to much behind (

Streaming expressions - Any plans to add one to many fetches to the fetch decorator?

2017-03-27 Thread adfel70
Any ideas how to workaround this with the current streaming capabilities? -- View this message in context: http://lucene.472066.n3.nabble.com/Streaming-expressions-Any-plans-to-add-one-to-many-fetches-to-the-fetch-decorator-tp4326989.html Sent from the Solr - User mailing list archive at

Re: Classify document using bag of words

2017-03-27 Thread alessandro.benedetti
Hi marotosg, john's suggestion will definitely work ( I recommend you a copyfield for that analysis). What happens in your use case if a word is in common for more than one bag of word ( if possible at all in your use case)? Do you expect to get back all the classes ? scored in some way ? In

Schema API: Modify Unique Key

2017-03-27 Thread nabil Kouici
Hi All, We're going to use Solr in our organization (under test) and we want to set the primary key through schema API, which is not allowed today. Is this function planned to be implemented in Solr? If yes, do you have any idea in which version? Regards,Nabil.   

RE: Index scanned documents

2017-03-27 Thread Allison, Timothy B.
Please also see: https://wiki.apache.org/tika/TikaOCR and https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29#OCR If you have any other questions about Apache Tika and OCR, please feel free to ask on our users list as well: u...@tika.apache.org Cheers, Tim

Version upgrading approaches

2017-03-27 Thread John Blythe
Hi all. The new versions of solr come out in pretty regular fashion. We are currently on 6.0. I'm curious what drives you / your team to run the upgrades when you do. Particular features or patches you're eyeballing? Only concerned w major releases? Some other protocol that is set internally? --

Re: Is there a way to retrieve the a term's position/offset in Solr

2017-03-27 Thread Emir Arnautovic
It seems to me that you are looking for Solr's highlighting functionality: https://cwiki.apache.org/confluence/display/solr/Highlighting HTH, Emir On 27.03.2017 09:09, forest_soup wrote: We are going to implement a feature: When opening a document whose body field is already indexed in Solr,

RE: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://10.175.190.72:8999/solr/product: Rollback is currently not supported in SolrCloud mode. (SOLR-4895

2017-03-27 Thread Mikhail Ibraheem
Any help please? -Original Message- From: Mikhail Ibraheem Sent: 26 مارس, 2017 10:22 م To: solr-user@lucene.apache.org Subject: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://10.175.190.72:8999/solr/product: Rollback is currently not

[ANNOUNCE] Apache Solr 6.5.0 released

2017-03-27 Thread jim ferenczi
27 March 2017, Apache Solr 6.5.0 available The Lucene PMC is pleased to announce the release of Apache Solr 6.5.0. Solr is the popular, blazing fast, open source NoSQL search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted

Is there a way to retrieve the a term's position/offset in Solr

2017-03-27 Thread forest_soup
We are going to implement a feature: When opening a document whose body field is already indexed in Solr, if we issued a keyword search before opening the doc, highlight the keyword in the opening document. That needs the position/offset info of the keyword in the doc's index, which I think

AW: Newbie in Solr

2017-03-27 Thread Ercan Karadeniz
Hi Alexandre, thanks for your response. I will check the provided URLs. Probably I will bother you with questions. Cheers, Ercan Von: Alexandre Rafalovitch Gesendet: Freitag, 24. März 2017 01:00 An: solr-user Betreff: Re: Newbie in