StoredField
is there any way to get ByteRef from a field originally stored as String? I am playing with Sorter to implement StoredDocSorter, analogous to NumericDocValuesSorter. But realised I do not need ByteRef - String conversion just to compare fields (byte order would be as good for sorting) StoredDocument d1 = reader.document(docID1, fieldNamesSet); String value1 = d1.get(fieldName) String value1 = d1.getStringAsBytesValue(fieldName)// would love to have it I need String type in other places, so indexing as byte[] would be too much hassle. String is internally stored as byte[], no reason not to expose it for StoredField (or any other type)?
Re: StoredField
Shai, was that irony or I am missing something big time? I would like to spare BytesRef - String conversion, not to introduce another one back to BytesRef Simply, for sorting, you do not need to do this byte[]-String conversion, byte representation of the String is perfectly sortable… On Mar 17, 2013, at 1:53 PM, Shai Erera ser...@gmail.com wrote: You can do new BytesRef(d1.get(fieldName)). Shai On Sun, Mar 17, 2013 at 2:43 PM, eksdev eks...@googlemail.com wrote: is there any way to get ByteRef from a field originally stored as String? I am playing with Sorter to implement StoredDocSorter, analogous to NumericDocValuesSorter. But realised I do not need ByteRef - String conversion just to compare fields (byte order would be as good for sorting) StoredDocument d1 = reader.document(docID1, fieldNamesSet); String value1 = d1.get(fieldName) String value1 = d1.getStringAsBytesValue(fieldName)// would love to have it I need String type in other places, so indexing as byte[] would be too much hassle. String is internally stored as byte[], no reason not to expose it for StoredField (or any other type)?
Re: StoredField
sure, there is a way to make anything - byte[] ;) it looks like this byte[]-type conversion is done deep-down and this visitor user-api gets already correct types … Maybe an idea would be to delay byte[] - type conversion to field access time, i do not know what mines would be on the road to do it. use cases that require identity checks, or not locale specific sorting and co would benefit from having row, serialised representations without type conversion…. anyhow, I could switch overt to byte[] fields completely to do ii… Thanks for responding! On Mar 17, 2013, at 2:24 PM, Shai Erera ser...@gmail.com wrote: No no, not irony at all. I misunderstood the first time. You wrote is there any way to get ByteRef from a field originally stored as String?, so I understand the first thing that came to mind :). But I understand the question now -- you say that since the String field is written as byte[] in the file, you want to read the byte[] as they are, without translating them to String. right? I don't know if it's possible. I'd try field.binaryValue(), though looking at the impl it doesn't suggest it will do what you want. Shai On Sun, Mar 17, 2013 at 3:02 PM, eksdev eks...@googlemail.com wrote: Shai, was that irony or I am missing something big time? I would like to spare BytesRef - String conversion, not to introduce another one back to BytesRef Simply, for sorting, you do not need to do this byte[]-String conversion, byte representation of the String is perfectly sortable… On Mar 17, 2013, at 1:53 PM, Shai Erera ser...@gmail.com wrote: You can do new BytesRef(d1.get(fieldName)). Shai On Sun, Mar 17, 2013 at 2:43 PM, eksdev eks...@googlemail.com wrote: is there any way to get ByteRef from a field originally stored as String? I am playing with Sorter to implement StoredDocSorter, analogous to NumericDocValuesSorter. But realised I do not need ByteRef - String conversion just to compare fields (byte order would be as good for sorting) StoredDocument d1 = reader.document(docID1, fieldNamesSet); String value1 = d1.get(fieldName) String value1 = d1.getStringAsBytesValue(fieldName)// would love to have it I need String type in other places, so indexing as byte[] would be too much hassle. String is internally stored as byte[], no reason not to expose it for StoredField (or any other type)?
Re: StoredField
Hi Adrian, I cannot tell if such thing would make it less or more robust, just thinking aloud :) I am thinking of it as a way to somehow postpone byte-type conversion to the moment where it is really needed. Simply, keep byte[] around as long as possible. *Theoretically*, this should improve gc() and memory footprint for some types of downstream processing. It all depends how easy would something like that be. There is already a way to achieve this by using binary field type, … hmmm, maybe some lucene.expert hack to make Lucene think every field is binary wold be simple and robust enough? e.g. Visitor.transportOnlySerializedValuesWithoutTypeConversion() - By the way, the trick with tim-sort in Sorter worked great. For 1.1 Mio short documents, the time to sort unsorted index on handful of stored fields went from 490 seconds to 380. Congrats and thanks for it! It also improved compression by 12% (very small, 4k chunk size) On Mar 17, 2013, at 5:26 PM, Adrien Grand jpou...@gmail.com wrote: Hi, On Sun, Mar 17, 2013 at 2:58 PM, eksdev eks...@googlemail.com wrote: sure, there is a way to make anything - byte[] ;) it looks like this byte[]-type conversion is done deep-down and this visitor user-api gets already correct types … Maybe an idea would be to delay byte[] - type conversion to field access time, i do not know what mines would be on the road to do it. use cases that require identity checks, or not locale specific sorting and co would benefit from having row, serialised representations without type conversion…. anyhow, I could switch overt to byte[] fields completely to do ii… I understand that it is frustrating to perform a String - byte[] conversion if Lucene just did the opposite. But because it needs to perform one random seek per document (on a file which is often large), the stored fields API is much slower than a String - UTF-8 bytes conversion, so I think we should keep the API robust rather than allowing for these kinds of optimizations? -- Adrien - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [jira] [Commented] (LUCENE-3918) Port index sorter to trunk APIs
humpf, do we actually need stored fields? What is wrong with having byte[] DV that stores all document fields, e.g. avro or something simpler to serialise all document fields into one byte[]? I am definitely missing something about DV/Stored fields diff, not sure what? On Mar 6, 2013, at 8:18 PM, Andrzej Bialecki (JIRA) j...@apache.org wrote: [ https://issues.apache.org/jira/browse/LUCENE-3918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13594997#comment-13594997 ] Andrzej Bialecki commented on LUCENE-3918: --- bq. I still don't get why someone would use stored fields rather than doc values (either binary, sorted or numeric) to sort his index. I think it's important to make users understand that stored fields are only useful to display results? This is a legacy of the original usage of this tool in Nutch - indexes would use a PageRank value as a document boost, and that was the value to be used for sorting - but since the doc boost is not recoverable from an existing index the value itself was stored in a stored field. And definitely DV didn't exist yet at that time :) Port index sorter to trunk APIs --- Key: LUCENE-3918 URL: https://issues.apache.org/jira/browse/LUCENE-3918 Project: Lucene - Core Issue Type: Task Components: modules/other Affects Versions: 4.0-ALPHA Reporter: Robert Muir Fix For: 4.2, 5.0 Attachments: LUCENE-3918.patch, LUCENE-3918.patch, LUCENE-3918.patch, LUCENE-3918.patch, LUCENE-3918.patch, LUCENE-3918.patch, LUCENE-3918.patch, LUCENE-3918.patch, LUCENE-3918.patch, LUCENE-3918.patch, LUCENE-3918.patch, LUCENE-3918.patch, LUCENE-3918.patch LUCENE-2482 added an IndexSorter to 3.x, but we need to port this functionality to 4.0 apis. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Win7 64bit, jvm 7 and MMAP OOM
Just to share some experience if someone hits the same problem. We had huge problems on Win7 64bit, JVM 64bit 1.7.0_07, (a few days old trunk version, 5.0, default codec) solr under tomcat thread queue limited to 20 . NRTCaching and MMAP have the same problems (no updates , just search). Unmap hack was on Problem: After hitting it for 15 minutes with 10 search threads, gc() did not manage to catch-up and free enough memory and the server repeatably spiralled to OOM (giving more max heap did not help as well). Server is running on 8 cores, and 10 client threads are normally not an issue. Observations: - Tweaking jvm memory and gc() options did not help at all. - exactly the same configuration and tests on 3 linux flavours had absolutely no problems. - Win using FSDirectory works slower, but stable - When OOM spiralling happens, major culprits are, by occupied memory and Noo of instances: o.a.l.util.WeakIdentityMap$IdentityWeakReference java.util.concurrent.ConcurrentHashMap$HashEntry java.util.concurrent.ConcurrentHashMap$HashEntry[] - if pausing search requests for really long time (5 to 10 minutes!) these references get eventually released. -- I know java+MMAP on win platforms has problems (slowly releasing mapped regions), but I did not expect it is that bad, to the point of being useless. It is not an itch currently, all our production is on linux, but if someone has an idea how to work around it, we would be glad to try it. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [jira] [Commented] (SOLR-4032) Unable to replicate between nodes ( read past EOF)
thanks Mark! On Nov 27, 2012, at 8:43 PM, Mark Miller (JIRA) j...@apache.org wrote: [ https://issues.apache.org/jira/browse/SOLR-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13504875#comment-13504875 ] Mark Miller commented on SOLR-4032: --- I'll try and make a fix soon. Unable to replicate between nodes ( read past EOF) -- Key: SOLR-4032 URL: https://issues.apache.org/jira/browse/SOLR-4032 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.0 Environment: 5.0-SNAPSHOT 1366361:1404534M - markus - 2012-11-01 12:37:38 Debian Squeeze, Tomcat 6, Sun Java 6, 10 nodes, 10 shards, rep. factor 2. Reporter: Markus Jelsma Assignee: Mark Miller Fix For: 4.1, 5.0 Please see: http://lucene.472066.n3.nabble.com/trunk-is-unable-to-replicate-between-nodes-Unable-to-download-completely-td4017049.html and http://lucene.472066.n3.nabble.com/Possible-memory-leak-in-recovery-td4017833.html -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Compressed stored fields and multiGet(sorted luceneId[])?
On Nov 8, 2012, at 11:30 AM, Robert Muir rcm...@gmail.com wrote: Thanks everybody for response, and much more of the same for the great project Why are you retrieving thousands of stored fields? I do not think it is all that rare that people actually do something with information but display summaries? Clustering in solr does exactly that, online record linkage follows exactly the same pattern. A pattern fetch thousands of candidates and run some heavy processing on them is surely not a typical web search engine usage, but philosophically, a model: a) search data b) do something with it c) deliver is not that strange? You say, b) should not be done using stored fields, ok I trust you, but going to database/nosql/anything is even slower. What approach would you recommend? the probability of two documents of the same results page being in the same chunk is very low. Adrian, Robert, this is 100% correct, no objection there. In this particular case we are using locality of reference heavily. We simply sort the data and reindex from time to time. You have to be lucky to be able to sort the documents, but we do not use lucene for big chunks of text, rather for almost fully structured data and we know how to sort this data to preserve locality of reference… Also a bit unusual, but I do not think all that rare scenario. Sorting data (where possible) was a great optimisation tip for many applications, even before compression. really you should roll your own codec for this and specialise. Yes, already started thinking about it, but we will first try to play with chunk size to see if we can achieve the goal without own codec … - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Compressed stored fields and multiGet(sorted luceneId[])?
Just a theoretical question, would it make sense to add some sort of StoredDocument[] bulkGet(int[] docId) to fetch multiple stored documents in one go? The reasoning behind is that now with compressed blocks random-access gets more expensive, and in some cases a user needs to fetch more documents in one go. If it happens that more documents come from one block it is a win. I would also assume, even without compression , bulk access on sorted docIds cold be a win (sequential access)? Does that make sense, is it doable? Or even worse, does it already exist :) By the way, I am impressed how well compression does, even on really short stored documents, approx. 150b we observe 35% reduction. Fetching 1000 short documents on fully cached index is observably slower (2-3 times), but as soon as you memory gets low, compression wins quickly. Did not test it thoroughly, but looks good so far. Great job! - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: java.lang.NoSuchMethodError: org.apache.lucene.search.Scorer.freq()
I *think* I did. Will try to build again... On Nov 2, 2012, at 1:12 PM, Simon Willnauer simon.willna...@gmail.com wrote: did you clean your checkout? simon On Fri, Nov 2, 2012 at 1:10 PM, eksdev eks...@googlemail.com wrote: debugQuery=true on on simple TermQuery (/trunk version from yesterday) throws exception on explain() It seams some of the scorers do not play nicely with explain()? java.lang.NoSuchMethodError: org.apache.lucene.search.Scorer.freq()F\n\tat org.apache.lucene.search.TermQuery$TermWeight.explain(TermQuery.java:119)\n\tat org.apache.lucene.search.IndexSearcher.explain(IndexSearcher.java:636)\n\tat org.apache.lucene.search.IndexSearcher.explain(IndexSearcher.java:616)\n\tat org.apache.solr.search.SolrIndexSearcher.explain(SolrIndexSearcher.java:1949)\n\tat org.apache.solr.util.SolrPluginUtils.getExplanations(SolrPluginUtils.java:352)\n\tat - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: java.lang.NoSuchMethodError: org.apache.lucene.search.Scorer.freq()
clean did it, never use eclipse if you want to know exactly what you did, command line works always sorry for the noise On Nov 2, 2012, at 1:15 PM, eksdev eks...@googlemail.com wrote: I *think* I did. Will try to build again... On Nov 2, 2012, at 1:12 PM, Simon Willnauer simon.willna...@gmail.com wrote: did you clean your checkout? simon On Fri, Nov 2, 2012 at 1:10 PM, eksdev eks...@googlemail.com wrote: debugQuery=true on on simple TermQuery (/trunk version from yesterday) throws exception on explain() It seams some of the scorers do not play nicely with explain()? java.lang.NoSuchMethodError: org.apache.lucene.search.Scorer.freq()F\n\tat org.apache.lucene.search.TermQuery$TermWeight.explain(TermQuery.java:119)\n\tat org.apache.lucene.search.IndexSearcher.explain(IndexSearcher.java:636)\n\tat org.apache.lucene.search.IndexSearcher.explain(IndexSearcher.java:616)\n\tat org.apache.solr.search.SolrIndexSearcher.explain(SolrIndexSearcher.java:1949)\n\tat org.apache.solr.util.SolrPluginUtils.getExplanations(SolrPluginUtils.java:352)\n\tat - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org