[jira] [Commented] (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094364#comment-13094364 ] Simon Willnauer commented on LUCENE-2308: - Hey guys, why don't we put plain old immutable java objects with a single ctor into core and add a builder API into modules / sandbox? This keeps things simple in core and if users want to use it they can grab it out of a module? bq. Can we avoid the builder API? I think we shouldnt invite accidental creation of lots of FieldType instances during indexing... why not just a single ctor in fieldtype that takes all the parameters the base class cares about? then it serves double-duty as the 'expert' fieldtype anyway, subclasses like TextField are just the sugar. so I haven't seen a single technical argument against a builder here. I personally think that a builder has many advantages: * simple to add new fields, doesn't need deprecation if you add another field to a type * simple to use since lots of people are use to chaining * provides immutability by design * represents a small but clear DSL to build a field type. you could do things like providing setters for TV only if you chain it with a call to indexed() like: {code} builder.indexed().storeTV(); {code} which would not be visible otherwise. * a ctor call will require many parameters that you don't want to set, but you're forced to pass a value for them anyway * since most of the parameters are booleans long sequences of identically typed parameters can cause subtle bugs. If the user accidentally reverses two such parameters, the compiler won't complain, but the program will misbehave at runtime. That sucks! especially if you spend hours of indexing and realize that your TV has not been stored because you missed to set indexed = true * builder code is easy to write and, more importantly, to read. * a builder simulates named optional parameters like in python and other languages which java is lacking. I think the Builder pattern is a good choice when designing classes whose constructors would have more than a handful of parameters, especially if most of those parameters are optional. Client code is much easier to read and write with builders than with the traditional telescoping constructor pattern. Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: core/index Reporter: Michael McCandless Assignee: Michael McCandless Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: LUCENE-2308-10.patch, LUCENE-2308-11.patch, LUCENE-2308-12.patch, LUCENE-2308-13.patch, LUCENE-2308-14.patch, LUCENE-2308-15.patch, LUCENE-2308-16.patch, LUCENE-2308-17.patch, LUCENE-2308-18.patch, LUCENE-2308-19.patch, LUCENE-2308-2.patch, LUCENE-2308-20.patch, LUCENE-2308-21.patch, LUCENE-2308-3.patch, LUCENE-2308-4.patch, LUCENE-2308-5.patch, LUCENE-2308-6.patch, LUCENE-2308-7.patch, LUCENE-2308-8.patch, LUCENE-2308-9.patch, LUCENE-2308-branch.patch, LUCENE-2308-final.patch, LUCENE-2308-ltc.patch, LUCENE-2308-merge-1.patch, LUCENE-2308-merge-2.patch, LUCENE-2308-merge-3.patch, LUCENE-2308.branchdiffs, LUCENE-2308.branchdiffs.moved, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail:
new AutomatonQuery(RunAutomaton) ?
At the moment it is not possible (?) to construct AutomatonQuery with RunAutomaton. Would it make sense to add this possibility? Is it doable at all? I have to keep a collection of RunAtomaton-s for other purposes (after search feature extraction) and it would be handy to feed them directly to AutomatonQuery. I could as well keep cached AutomatonQuery objects (Field name does not change), but then I would need to get (Run)Automaton from the Query... Thanks, eks. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2312) Search on IndexWriter's RAM Buffer
[ https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094368#comment-13094368 ] Simon Willnauer commented on LUCENE-2312: - jason, I will look at this patch soon I hope. Busy times here right now so gimme some time. thanks Search on IndexWriter's RAM Buffer -- Key: LUCENE-2312 URL: https://issues.apache.org/jira/browse/LUCENE-2312 Project: Lucene - Java Issue Type: New Feature Components: core/search Affects Versions: Realtime Branch Reporter: Jason Rutherglen Assignee: Michael Busch Fix For: Realtime Branch Attachments: LUCENE-2312-FC.patch, LUCENE-2312.patch, LUCENE-2312.patch, LUCENE-2312.patch In order to offer user's near realtime search, without incurring an indexing performance penalty, we can implement search on IndexWriter's RAM buffer. This is the buffer that is filled in RAM as documents are indexed. Currently the RAM buffer is flushed to the underlying directory (usually disk) before being made searchable. Todays Lucene based NRT systems must incur the cost of merging segments, which can slow indexing. Michael Busch has good suggestions regarding how to handle deletes using max doc ids. https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923 The area that isn't fully fleshed out is the terms dictionary, which needs to be sorted prior to queries executing. Currently IW implements a specialized hash table. Michael B has a suggestion here: https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915 -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094371#comment-13094371 ] Chris Male commented on LUCENE-2308: +1 I couldn't have put it better myself. Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: core/index Reporter: Michael McCandless Assignee: Michael McCandless Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: LUCENE-2308-10.patch, LUCENE-2308-11.patch, LUCENE-2308-12.patch, LUCENE-2308-13.patch, LUCENE-2308-14.patch, LUCENE-2308-15.patch, LUCENE-2308-16.patch, LUCENE-2308-17.patch, LUCENE-2308-18.patch, LUCENE-2308-19.patch, LUCENE-2308-2.patch, LUCENE-2308-20.patch, LUCENE-2308-21.patch, LUCENE-2308-3.patch, LUCENE-2308-4.patch, LUCENE-2308-5.patch, LUCENE-2308-6.patch, LUCENE-2308-7.patch, LUCENE-2308-8.patch, LUCENE-2308-9.patch, LUCENE-2308-branch.patch, LUCENE-2308-final.patch, LUCENE-2308-ltc.patch, LUCENE-2308-merge-1.patch, LUCENE-2308-merge-2.patch, LUCENE-2308-merge-3.patch, LUCENE-2308.branchdiffs, LUCENE-2308.branchdiffs.moved, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094378#comment-13094378 ] Uwe Schindler commented on LUCENE-2308: --- +1 I agree, too! I am personally in favour of builder patterns when parameters get beyond 3 or 4, especially if they are simply booleans. Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: core/index Reporter: Michael McCandless Assignee: Michael McCandless Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: LUCENE-2308-10.patch, LUCENE-2308-11.patch, LUCENE-2308-12.patch, LUCENE-2308-13.patch, LUCENE-2308-14.patch, LUCENE-2308-15.patch, LUCENE-2308-16.patch, LUCENE-2308-17.patch, LUCENE-2308-18.patch, LUCENE-2308-19.patch, LUCENE-2308-2.patch, LUCENE-2308-20.patch, LUCENE-2308-21.patch, LUCENE-2308-3.patch, LUCENE-2308-4.patch, LUCENE-2308-5.patch, LUCENE-2308-6.patch, LUCENE-2308-7.patch, LUCENE-2308-8.patch, LUCENE-2308-9.patch, LUCENE-2308-branch.patch, LUCENE-2308-final.patch, LUCENE-2308-ltc.patch, LUCENE-2308-merge-1.patch, LUCENE-2308-merge-2.patch, LUCENE-2308-merge-3.patch, LUCENE-2308.branchdiffs, LUCENE-2308.branchdiffs.moved, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094379#comment-13094379 ] Uwe Schindler commented on LUCENE-2308: --- Somehow related, but for the same reasons (too many booleans in ctor), WordDelimiterFilter would also be a candidate for a WordDelimiterFilterBuilder. Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: core/index Reporter: Michael McCandless Assignee: Michael McCandless Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: LUCENE-2308-10.patch, LUCENE-2308-11.patch, LUCENE-2308-12.patch, LUCENE-2308-13.patch, LUCENE-2308-14.patch, LUCENE-2308-15.patch, LUCENE-2308-16.patch, LUCENE-2308-17.patch, LUCENE-2308-18.patch, LUCENE-2308-19.patch, LUCENE-2308-2.patch, LUCENE-2308-20.patch, LUCENE-2308-21.patch, LUCENE-2308-3.patch, LUCENE-2308-4.patch, LUCENE-2308-5.patch, LUCENE-2308-6.patch, LUCENE-2308-7.patch, LUCENE-2308-8.patch, LUCENE-2308-9.patch, LUCENE-2308-branch.patch, LUCENE-2308-final.patch, LUCENE-2308-ltc.patch, LUCENE-2308-merge-1.patch, LUCENE-2308-merge-2.patch, LUCENE-2308-merge-3.patch, LUCENE-2308.branchdiffs, LUCENE-2308.branchdiffs.moved, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3409) NRT reader/writer over RAMDirectory memory leak
NRT reader/writer over RAMDirectory memory leak --- Key: LUCENE-3409 URL: https://issues.apache.org/jira/browse/LUCENE-3409 Project: Lucene - Java Issue Type: Bug Components: core/index Affects Versions: 3.3, 3.0.2 Reporter: tal steier with NRT reader/writer, emptying an index using: writer.deleteAll() writer.commit() doesn't release all allocated memory. for example the following code will generate a memory leak: /** * Reveals a memory leak in NRT reader/writerbr * * The following main() does 10K cycles of: * ul * liAdd 10K empty documents to index writer/li * licommit()/li * liopen NRT reader over the writer, and immediately close it/li * lidelete all documents from the writer/li * licommit changes to the writer/li * /ul * * Running with -Xmx256M results in an OOME after ~2600 cycles */ public static void main(String[] args) throws Exception { RAMDirectory d = new RAMDirectory(); IndexWriter w = new IndexWriter(d, new IndexWriterConfig(Version.LUCENE_33, new KeywordAnalyzer())); Document doc = new Document(); for(int i = 0; i 1; i++) { for(int j = 0; j 1; ++j) { w.addDocument(doc); } w.commit(); IndexReader.open(w, true).close(); w.deleteAll(); w.commit(); } w.close(); d.close(); } -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3409) NRT reader/writer over RAMDirectory memory leak
[ https://issues.apache.org/jira/browse/LUCENE-3409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094388#comment-13094388 ] Gilad Barkai commented on LUCENE-3409: -- This issue is relevant for trunk as well. Please update the Affected versions accordingly. NRT reader/writer over RAMDirectory memory leak --- Key: LUCENE-3409 URL: https://issues.apache.org/jira/browse/LUCENE-3409 Project: Lucene - Java Issue Type: Bug Components: core/index Affects Versions: 3.0.2, 3.3 Reporter: tal steier with NRT reader/writer, emptying an index using: writer.deleteAll() writer.commit() doesn't release all allocated memory. for example the following code will generate a memory leak: /** * Reveals a memory leak in NRT reader/writerbr * * The following main() does 10K cycles of: * ul * liAdd 10K empty documents to index writer/li * licommit()/li * liopen NRT reader over the writer, and immediately close it/li * lidelete all documents from the writer/li * licommit changes to the writer/li * /ul * * Running with -Xmx256M results in an OOME after ~2600 cycles */ public static void main(String[] args) throws Exception { RAMDirectory d = new RAMDirectory(); IndexWriter w = new IndexWriter(d, new IndexWriterConfig(Version.LUCENE_33, new KeywordAnalyzer())); Document doc = new Document(); for(int i = 0; i 1; i++) { for(int j = 0; j 1; ++j) { w.addDocument(doc); } w.commit(); IndexReader.open(w, true).close(); w.deleteAll(); w.commit(); } w.close(); d.close(); } -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3130) Use BoostAttribute in in TokenFilters to denote Terms that QueryParser should give lower boosts
[ https://issues.apache.org/jira/browse/LUCENE-3130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094397#comment-13094397 ] Jan Høydahl commented on LUCENE-3130: - Let's get back to the original issue: we need some way to let the original form of a term have higher weight than the alternative forms generated by analysis (whether those are synonyms, stems, lowercase or what have you). Is tagging the added tokens with a tokenType, and then enabling the QParsers to act on these tokenTypes a viable way forward? Use BoostAttribute in in TokenFilters to denote Terms that QueryParser should give lower boosts --- Key: LUCENE-3130 URL: https://issues.apache.org/jira/browse/LUCENE-3130 Project: Lucene - Java Issue Type: Improvement Reporter: Hoss Man A recent thread asked if there was anyway to use QueryTime synonyms such that matches on the original term specified by the user would score higher then matches on the synonym. It occurred to me later that a float Attribute could be set by the SynonymFilter in such situations, and QueryParser could use that float as a boost in the resulting Query. IThis would be fairly straightforward for the simple synonyms = BooleamQuery case, but we'd have to decide how to handle the case of synonyms with multiple terms that produce MTPQ, possibly just punt for now) Likewise, there may be other TokenFilters that inject artificial tokens at query time where it also might make sense to have a reduced boost factor... * SynonymFilter * CommonGramsFilter * WordDelimiterFilter * etc... In all of these cases, the amount of the boost could me configured, and for back compact could default to 1.0 (or null to not set a boost at all) Furthermore: if we add a new BoostAttrToPayloadAttrFilter that just copied the boost attribute into the payload attribute, these same filters could give penalizing payloads to terms when used at index time) could give penalizing payloads to terms. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Assigned] (LUCENE-3409) NRT reader/writer over RAMDirectory memory leak
[ https://issues.apache.org/jira/browse/LUCENE-3409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless reassigned LUCENE-3409: -- Assignee: Michael McCandless NRT reader/writer over RAMDirectory memory leak --- Key: LUCENE-3409 URL: https://issues.apache.org/jira/browse/LUCENE-3409 Project: Lucene - Java Issue Type: Bug Components: core/index Affects Versions: 3.0.2, 3.3 Reporter: tal steier Assignee: Michael McCandless with NRT reader/writer, emptying an index using: writer.deleteAll() writer.commit() doesn't release all allocated memory. for example the following code will generate a memory leak: /** * Reveals a memory leak in NRT reader/writerbr * * The following main() does 10K cycles of: * ul * liAdd 10K empty documents to index writer/li * licommit()/li * liopen NRT reader over the writer, and immediately close it/li * lidelete all documents from the writer/li * licommit changes to the writer/li * /ul * * Running with -Xmx256M results in an OOME after ~2600 cycles */ public static void main(String[] args) throws Exception { RAMDirectory d = new RAMDirectory(); IndexWriter w = new IndexWriter(d, new IndexWriterConfig(Version.LUCENE_33, new KeywordAnalyzer())); Document doc = new Document(); for(int i = 0; i 1; i++) { for(int j = 0; j 1; ++j) { w.addDocument(doc); } w.commit(); IndexReader.open(w, true).close(); w.deleteAll(); w.commit(); } w.close(); d.close(); } -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3408) Remove unnecessary memory barriers in DWPT
[ https://issues.apache.org/jira/browse/LUCENE-3408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094406#comment-13094406 ] Michael McCandless commented on LUCENE-3408: Looks good Simon! Have you tested perf...? Likely minor but you never know :) Remove unnecessary memory barriers in DWPT -- Key: LUCENE-3408 URL: https://issues.apache.org/jira/browse/LUCENE-3408 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 4.0 Reporter: Simon Willnauer Assignee: Simon Willnauer Priority: Minor Fix For: 4.0 Attachments: LUCENE-3408.patch Currently DWPT still uses AtomicLong to count the bytesUsed. Each write access issues an implicite memory barrier which is totally unnecessary since we doing everything single threaded on that level. This might be very minor but we shouldn't issue unnecessary memory barriers causing processors to lock their instruction pipeline for no reason. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2723) If you don't choose a shard name for a SolrCore, the system should auto assign shard names.
[ https://issues.apache.org/jira/browse/SOLR-2723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094407#comment-13094407 ] Jan Høydahl commented on SOLR-2723: --- Hmm, should startup sequence determine role? Sceptical, but if this is on first boot only, and only if not choosing a shard name, perhaps... If you don't choose a shard name for a SolrCore, the system should auto assign shard names. --- Key: SOLR-2723 URL: https://issues.apache.org/jira/browse/SOLR-2723 Project: Solr Issue Type: New Feature Components: SolrCloud Reporter: Mark Miller Fix For: 4.0 When you first boot up a node with the collection files to use, you might also pass how many slices you want - if you choose 3 slices, the first 3 nodes that come up would each go to a different slice and get a unique shard name - further nodes that come up would be replicas in each slice and get one of the 3 shard names. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3408) Remove unnecessary memory barriers in DWPT
[ https://issues.apache.org/jira/browse/LUCENE-3408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094409#comment-13094409 ] Simon Willnauer commented on LUCENE-3408: - no I haven't tested perf yet, I think I will just wait for the nightly benchmark here. I plan to commit this soon. Remove unnecessary memory barriers in DWPT -- Key: LUCENE-3408 URL: https://issues.apache.org/jira/browse/LUCENE-3408 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 4.0 Reporter: Simon Willnauer Assignee: Simon Willnauer Priority: Minor Fix For: 4.0 Attachments: LUCENE-3408.patch Currently DWPT still uses AtomicLong to count the bytesUsed. Each write access issues an implicite memory barrier which is totally unnecessary since we doing everything single threaded on that level. This might be very minor but we shouldn't issue unnecessary memory barriers causing processors to lock their instruction pipeline for no reason. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3409) NRT reader/writer over RAMDirectory memory leak
[ https://issues.apache.org/jira/browse/LUCENE-3409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-3409: --- Affects Version/s: 4.0 Fix Version/s: 4.0 3.4 I found the issue: we are failing to drop pool'd readers in IW.deleteAll. I'll commit fix shortly. NRT reader/writer over RAMDirectory memory leak --- Key: LUCENE-3409 URL: https://issues.apache.org/jira/browse/LUCENE-3409 Project: Lucene - Java Issue Type: Bug Components: core/index Affects Versions: 3.0.2, 3.3, 4.0 Reporter: tal steier Assignee: Michael McCandless Fix For: 3.4, 4.0 with NRT reader/writer, emptying an index using: writer.deleteAll() writer.commit() doesn't release all allocated memory. for example the following code will generate a memory leak: /** * Reveals a memory leak in NRT reader/writerbr * * The following main() does 10K cycles of: * ul * liAdd 10K empty documents to index writer/li * licommit()/li * liopen NRT reader over the writer, and immediately close it/li * lidelete all documents from the writer/li * licommit changes to the writer/li * /ul * * Running with -Xmx256M results in an OOME after ~2600 cycles */ public static void main(String[] args) throws Exception { RAMDirectory d = new RAMDirectory(); IndexWriter w = new IndexWriter(d, new IndexWriterConfig(Version.LUCENE_33, new KeywordAnalyzer())); Document doc = new Document(); for(int i = 0; i 1; i++) { for(int j = 0; j 1; ++j) { w.addDocument(doc); } w.commit(); IndexReader.open(w, true).close(); w.deleteAll(); w.commit(); } w.close(); d.close(); } -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: svn commit: r1163568 - in /lucene/dev/trunk/lucene: CHANGES.txt src/java/org/apache/lucene/index/IndexFileDeleter.java src/java/org/apache/lucene/index/IndexWriter.java src/test/org/apache/lucene/
On Wed, Aug 31, 2011 at 12:36 PM, mikemcc...@apache.org wrote: Author: mikemccand Date: Wed Aug 31 10:36:36 2011 New Revision: 1163568 URL: http://svn.apache.org/viewvc?rev=1163568view=rev Log: LUCENE-3409: drop reader pool from IW.deleteAll Modified: lucene/dev/trunk/lucene/CHANGES.txt lucene/dev/trunk/lucene/src/java/org/apache/lucene/index/IndexFileDeleter.java lucene/dev/trunk/lucene/src/java/org/apache/lucene/index/IndexWriter.java lucene/dev/trunk/lucene/src/test/org/apache/lucene/index/TestIndexWriter.java Modified: lucene/dev/trunk/lucene/CHANGES.txt URL: http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/CHANGES.txt?rev=1163568r1=1163567r2=1163568view=diff == --- lucene/dev/trunk/lucene/CHANGES.txt (original) +++ lucene/dev/trunk/lucene/CHANGES.txt Wed Aug 31 10:36:36 2011 @@ -577,6 +577,10 @@ Bug fixes throw NoSuchDirectoryException when all files written so far have been written to one directory, but the other still has not yet been created on the filesystem. (Robert Muir) + +* LUCENE-3409: IndexWriter.deleteAll was failing to close pooled NRT + SegmentReaders, leading to unused files accumulating in the + Directory. (tal steier via Mike McCandless) New Features Modified: lucene/dev/trunk/lucene/src/java/org/apache/lucene/index/IndexFileDeleter.java URL: http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/src/java/org/apache/lucene/index/IndexFileDeleter.java?rev=1163568r1=1163567r2=1163568view=diff == --- lucene/dev/trunk/lucene/src/java/org/apache/lucene/index/IndexFileDeleter.java (original) +++ lucene/dev/trunk/lucene/src/java/org/apache/lucene/index/IndexFileDeleter.java Wed Aug 31 10:36:36 2011 @@ -374,6 +374,10 @@ final class IndexFileDeleter { } public void refresh() throws IOException { + // Set to null so that we regenerate the list of pending + // files; else we can accumulate same file more than + // once + deletable = null; refresh(null); } Modified: lucene/dev/trunk/lucene/src/java/org/apache/lucene/index/IndexWriter.java URL: http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/src/java/org/apache/lucene/index/IndexWriter.java?rev=1163568r1=1163567r2=1163568view=diff == --- lucene/dev/trunk/lucene/src/java/org/apache/lucene/index/IndexWriter.java (original) +++ lucene/dev/trunk/lucene/src/java/org/apache/lucene/index/IndexWriter.java Wed Aug 31 10:36:36 2011 @@ -600,6 +600,23 @@ public class IndexWriter implements Clos drop(info, IOContext.Context.MERGE); } + public synchronized void dropAll() throws IOException { + IteratorMap.EntrySegmentCacheKey,SegmentReader iter = readerMap.entrySet().iterator(); + while (iter.hasNext()) { + + final Map.EntrySegmentCacheKey,SegmentReader ent = iter.next(); + + SegmentReader sr = ent.getValue(); + sr.hasChanges = false; + iter.remove(); + + // NOTE: it is allowed that this decRef does not + // actually close the SR; this can happen when a + // near real-time reader using this SR is still open + sr.decRef(); + } + } + just being a little picky here, can't we simply iterate over the readerMap.values() and call readerMap.clear() afterwards like snip for (SegementReader sr : readerMap.values()) { sr.hasChange = false; sr.decRef(); } readerMap.clear(); /snip the iter.remove() call does a key lookup each time which is totally unnecessary (well this is super minor!) but it looks more readable, its less code and slightly more efficient? simon public synchronized void drop(SegmentInfo info, IOContext.Context context) throws IOException { final SegmentReader sr; if ((sr = readerMap.remove(new SegmentCacheKey(info, context))) != null) { @@ -2141,7 +2158,7 @@ public class IndexWriter implements Clos deleter.refresh(); // Don't bother saving any changes in our segmentInfos - readerPool.clear(null); + readerPool.dropAll(); // Mark that the index has changed ++changeCount; @@ -3698,7 +3715,6 @@ public class IndexWriter implements Clos synchronized(this) { deleter.deleteFile(compoundFileName); - deleter.deleteFile(IndexFileNames.segmentFileName(mergedName, , IndexFileNames.COMPOUND_FILE_ENTRIES_EXTENSION)); deleter.deleteNewFiles(merge.info.files()); } Modified: lucene/dev/trunk/lucene/src/test/org/apache/lucene/index/TestIndexWriter.java URL: http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/src/test/org/apache/lucene/index/TestIndexWriter.java?rev=1163568r1=1163567r2=1163568view=diff
[jira] [Resolved] (LUCENE-3408) Remove unnecessary memory barriers in DWPT
[ https://issues.apache.org/jira/browse/LUCENE-3408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer resolved LUCENE-3408. - Resolution: Fixed Remove unnecessary memory barriers in DWPT -- Key: LUCENE-3408 URL: https://issues.apache.org/jira/browse/LUCENE-3408 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 4.0 Reporter: Simon Willnauer Assignee: Simon Willnauer Priority: Minor Fix For: 4.0 Attachments: LUCENE-3408.patch Currently DWPT still uses AtomicLong to count the bytesUsed. Each write access issues an implicite memory barrier which is totally unnecessary since we doing everything single threaded on that level. This might be very minor but we shouldn't issue unnecessary memory barriers causing processors to lock their instruction pipeline for no reason. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Closed] (SOLR-2694) LogUpdateProcessor not thread safe
[ https://issues.apache.org/jira/browse/SOLR-2694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Høydahl closed SOLR-2694. - Resolution: Cannot Reproduce Closing this for now - could reopen if necessary... LogUpdateProcessor not thread safe -- Key: SOLR-2694 URL: https://issues.apache.org/jira/browse/SOLR-2694 Project: Solr Issue Type: Bug Components: update Affects Versions: 1.4.1, 3.1, 3.2, 3.3, 4.0 Reporter: Jan Høydahl Using the LogUpdateProcessor while feeding in multiple parallell threads does not work, as LogUpdateProcessor is not threadsafe. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: svn commit: r1163568 - in /lucene/dev/trunk/lucene: CHANGES.txt src/java/org/apache/lucene/index/IndexFileDeleter.java src/java/org/apache/lucene/index/IndexWriter.java src/test/org/apache/lucene/
Good idea, I'll fix! Mike McCandless http://blog.mikemccandless.com On Wed, Aug 31, 2011 at 6:58 AM, Simon Willnauer simon.willna...@googlemail.com wrote: On Wed, Aug 31, 2011 at 12:36 PM, mikemcc...@apache.org wrote: Author: mikemccand Date: Wed Aug 31 10:36:36 2011 New Revision: 1163568 URL: http://svn.apache.org/viewvc?rev=1163568view=rev Log: LUCENE-3409: drop reader pool from IW.deleteAll Modified: lucene/dev/trunk/lucene/CHANGES.txt lucene/dev/trunk/lucene/src/java/org/apache/lucene/index/IndexFileDeleter.java lucene/dev/trunk/lucene/src/java/org/apache/lucene/index/IndexWriter.java lucene/dev/trunk/lucene/src/test/org/apache/lucene/index/TestIndexWriter.java Modified: lucene/dev/trunk/lucene/CHANGES.txt URL: http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/CHANGES.txt?rev=1163568r1=1163567r2=1163568view=diff == --- lucene/dev/trunk/lucene/CHANGES.txt (original) +++ lucene/dev/trunk/lucene/CHANGES.txt Wed Aug 31 10:36:36 2011 @@ -577,6 +577,10 @@ Bug fixes throw NoSuchDirectoryException when all files written so far have been written to one directory, but the other still has not yet been created on the filesystem. (Robert Muir) + +* LUCENE-3409: IndexWriter.deleteAll was failing to close pooled NRT + SegmentReaders, leading to unused files accumulating in the + Directory. (tal steier via Mike McCandless) New Features Modified: lucene/dev/trunk/lucene/src/java/org/apache/lucene/index/IndexFileDeleter.java URL: http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/src/java/org/apache/lucene/index/IndexFileDeleter.java?rev=1163568r1=1163567r2=1163568view=diff == --- lucene/dev/trunk/lucene/src/java/org/apache/lucene/index/IndexFileDeleter.java (original) +++ lucene/dev/trunk/lucene/src/java/org/apache/lucene/index/IndexFileDeleter.java Wed Aug 31 10:36:36 2011 @@ -374,6 +374,10 @@ final class IndexFileDeleter { } public void refresh() throws IOException { + // Set to null so that we regenerate the list of pending + // files; else we can accumulate same file more than + // once + deletable = null; refresh(null); } Modified: lucene/dev/trunk/lucene/src/java/org/apache/lucene/index/IndexWriter.java URL: http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/src/java/org/apache/lucene/index/IndexWriter.java?rev=1163568r1=1163567r2=1163568view=diff == --- lucene/dev/trunk/lucene/src/java/org/apache/lucene/index/IndexWriter.java (original) +++ lucene/dev/trunk/lucene/src/java/org/apache/lucene/index/IndexWriter.java Wed Aug 31 10:36:36 2011 @@ -600,6 +600,23 @@ public class IndexWriter implements Clos drop(info, IOContext.Context.MERGE); } + public synchronized void dropAll() throws IOException { + IteratorMap.EntrySegmentCacheKey,SegmentReader iter = readerMap.entrySet().iterator(); + while (iter.hasNext()) { + + final Map.EntrySegmentCacheKey,SegmentReader ent = iter.next(); + + SegmentReader sr = ent.getValue(); + sr.hasChanges = false; + iter.remove(); + + // NOTE: it is allowed that this decRef does not + // actually close the SR; this can happen when a + // near real-time reader using this SR is still open + sr.decRef(); + } + } + just being a little picky here, can't we simply iterate over the readerMap.values() and call readerMap.clear() afterwards like snip for (SegementReader sr : readerMap.values()) { sr.hasChange = false; sr.decRef(); } readerMap.clear(); /snip the iter.remove() call does a key lookup each time which is totally unnecessary (well this is super minor!) but it looks more readable, its less code and slightly more efficient? simon public synchronized void drop(SegmentInfo info, IOContext.Context context) throws IOException { final SegmentReader sr; if ((sr = readerMap.remove(new SegmentCacheKey(info, context))) != null) { @@ -2141,7 +2158,7 @@ public class IndexWriter implements Clos deleter.refresh(); // Don't bother saving any changes in our segmentInfos - readerPool.clear(null); + readerPool.dropAll(); // Mark that the index has changed ++changeCount; @@ -3698,7 +3715,6 @@ public class IndexWriter implements Clos synchronized(this) { deleter.deleteFile(compoundFileName); - deleter.deleteFile(IndexFileNames.segmentFileName(mergedName, , IndexFileNames.COMPOUND_FILE_ENTRIES_EXTENSION)); deleter.deleteNewFiles(merge.info.files()); } Modified: lucene/dev/trunk/lucene/src/test/org/apache/lucene/index/TestIndexWriter.java URL:
Re: new AutomatonQuery(RunAutomaton) ?
On Wed, Aug 31, 2011 at 3:51 AM, eks dev eks...@yahoo.co.uk wrote: At the moment it is not possible (?) to construct AutomatonQuery with RunAutomaton. Would it make sense to add this possibility? Is it doable at all? Its not doable, we need more information than the runautomaton, its not enough. -- lucidimagination.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3130) Use BoostAttribute in in TokenFilters to denote Terms that QueryParser should give lower boosts
[ https://issues.apache.org/jira/browse/LUCENE-3130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094454#comment-13094454 ] Robert Muir commented on LUCENE-3130: - {quote} Let's get back to the original issue: we need some way to let the original form of a term have higher weight than the alternative forms generated by analysis (whether those are synonyms, stems, lowercase or what have you). {quote} I'm not sure we do! see my last response. I think 2 fields is just fine. As for things like synonyms, these already set TypeAttribute. So if your consumer wants to do something on synonyms, look for type = SYNONYM or whatever it already sets. Use BoostAttribute in in TokenFilters to denote Terms that QueryParser should give lower boosts --- Key: LUCENE-3130 URL: https://issues.apache.org/jira/browse/LUCENE-3130 Project: Lucene - Java Issue Type: Improvement Reporter: Hoss Man A recent thread asked if there was anyway to use QueryTime synonyms such that matches on the original term specified by the user would score higher then matches on the synonym. It occurred to me later that a float Attribute could be set by the SynonymFilter in such situations, and QueryParser could use that float as a boost in the resulting Query. IThis would be fairly straightforward for the simple synonyms = BooleamQuery case, but we'd have to decide how to handle the case of synonyms with multiple terms that produce MTPQ, possibly just punt for now) Likewise, there may be other TokenFilters that inject artificial tokens at query time where it also might make sense to have a reduced boost factor... * SynonymFilter * CommonGramsFilter * WordDelimiterFilter * etc... In all of these cases, the amount of the boost could me configured, and for back compact could default to 1.0 (or null to not set a boost at all) Furthermore: if we add a new BoostAttrToPayloadAttrFilter that just copied the boost attribute into the payload attribute, these same filters could give penalizing payloads to terms when used at index time) could give penalizing payloads to terms. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-3409) NRT reader/writer over RAMDirectory memory leak
[ https://issues.apache.org/jira/browse/LUCENE-3409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-3409. Resolution: Fixed Thanks tal! NRT reader/writer over RAMDirectory memory leak --- Key: LUCENE-3409 URL: https://issues.apache.org/jira/browse/LUCENE-3409 Project: Lucene - Java Issue Type: Bug Components: core/index Affects Versions: 3.0.2, 3.3, 4.0 Reporter: tal steier Assignee: Michael McCandless Fix For: 3.4, 4.0 with NRT reader/writer, emptying an index using: writer.deleteAll() writer.commit() doesn't release all allocated memory. for example the following code will generate a memory leak: /** * Reveals a memory leak in NRT reader/writerbr * * The following main() does 10K cycles of: * ul * liAdd 10K empty documents to index writer/li * licommit()/li * liopen NRT reader over the writer, and immediately close it/li * lidelete all documents from the writer/li * licommit changes to the writer/li * /ul * * Running with -Xmx256M results in an OOME after ~2600 cycles */ public static void main(String[] args) throws Exception { RAMDirectory d = new RAMDirectory(); IndexWriter w = new IndexWriter(d, new IndexWriterConfig(Version.LUCENE_33, new KeywordAnalyzer())); Document doc = new Document(); for(int i = 0; i 1; i++) { for(int j = 0; j 1; ++j) { w.addDocument(doc); } w.commit(); IndexReader.open(w, true).close(); w.deleteAll(); w.commit(); } w.close(); d.close(); } -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094462#comment-13094462 ] Robert Muir commented on LUCENE-2308: - {quote} so I haven't seen a single technical argument against a builder here. I personally think that a builder has many advantages: {quote} I gave one already, it creates too many objects. It also adds complexity to the API. Just because a constructor has a couple parameters does *NOT* mean a builder fits. In situations like this one, its a bad choice. Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: core/index Reporter: Michael McCandless Assignee: Michael McCandless Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: LUCENE-2308-10.patch, LUCENE-2308-11.patch, LUCENE-2308-12.patch, LUCENE-2308-13.patch, LUCENE-2308-14.patch, LUCENE-2308-15.patch, LUCENE-2308-16.patch, LUCENE-2308-17.patch, LUCENE-2308-18.patch, LUCENE-2308-19.patch, LUCENE-2308-2.patch, LUCENE-2308-20.patch, LUCENE-2308-21.patch, LUCENE-2308-3.patch, LUCENE-2308-4.patch, LUCENE-2308-5.patch, LUCENE-2308-6.patch, LUCENE-2308-7.patch, LUCENE-2308-8.patch, LUCENE-2308-9.patch, LUCENE-2308-branch.patch, LUCENE-2308-final.patch, LUCENE-2308-ltc.patch, LUCENE-2308-merge-1.patch, LUCENE-2308-merge-2.patch, LUCENE-2308-merge-3.patch, LUCENE-2308.branchdiffs, LUCENE-2308.branchdiffs.moved, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094466#comment-13094466 ] Chris Male commented on LUCENE-2308: How does it create too many objects? Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: core/index Reporter: Michael McCandless Assignee: Michael McCandless Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: LUCENE-2308-10.patch, LUCENE-2308-11.patch, LUCENE-2308-12.patch, LUCENE-2308-13.patch, LUCENE-2308-14.patch, LUCENE-2308-15.patch, LUCENE-2308-16.patch, LUCENE-2308-17.patch, LUCENE-2308-18.patch, LUCENE-2308-19.patch, LUCENE-2308-2.patch, LUCENE-2308-20.patch, LUCENE-2308-21.patch, LUCENE-2308-3.patch, LUCENE-2308-4.patch, LUCENE-2308-5.patch, LUCENE-2308-6.patch, LUCENE-2308-7.patch, LUCENE-2308-8.patch, LUCENE-2308-9.patch, LUCENE-2308-branch.patch, LUCENE-2308-final.patch, LUCENE-2308-ltc.patch, LUCENE-2308-merge-1.patch, LUCENE-2308-merge-2.patch, LUCENE-2308-merge-3.patch, LUCENE-2308.branchdiffs, LUCENE-2308.branchdiffs.moved, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094477#comment-13094477 ] Robert Muir commented on LUCENE-2308: - we have to realize, most people indexing with lucene do it like this: {noformat} while(...) { Document doc = new Document(...); Field field1 = new Field(...); Field field2 = new Field(...); } {noformat} So for MOST people FT is increasing the number of objects being created per-document (most people will create a new one for every field). I think we should keep that at a minimum. Adding a builder on top, will at minimum require an additional object for the builder itself *AND*: * creation of a new intermediate throw-away FieldType with *each* .set() OR * creation of an additional mutable object used internally by the builder which will require keeping in sync with the immutable form. Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: core/index Reporter: Michael McCandless Assignee: Michael McCandless Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: LUCENE-2308-10.patch, LUCENE-2308-11.patch, LUCENE-2308-12.patch, LUCENE-2308-13.patch, LUCENE-2308-14.patch, LUCENE-2308-15.patch, LUCENE-2308-16.patch, LUCENE-2308-17.patch, LUCENE-2308-18.patch, LUCENE-2308-19.patch, LUCENE-2308-2.patch, LUCENE-2308-20.patch, LUCENE-2308-21.patch, LUCENE-2308-3.patch, LUCENE-2308-4.patch, LUCENE-2308-5.patch, LUCENE-2308-6.patch, LUCENE-2308-7.patch, LUCENE-2308-8.patch, LUCENE-2308-9.patch, LUCENE-2308-branch.patch, LUCENE-2308-final.patch, LUCENE-2308-ltc.patch, LUCENE-2308-merge-1.patch, LUCENE-2308-merge-2.patch, LUCENE-2308-merge-3.patch, LUCENE-2308.branchdiffs, LUCENE-2308.branchdiffs.moved, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094476#comment-13094476 ] Uwe Schindler commented on LUCENE-2308: --- bq. How does it create too many objects? Thats implementation internal. If you want final unmodifiable objects, every builder call will produce a new one in its return parameter (see ScorerContext). In general the builder pattern can also change existing objects, like StringBuilder does. Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: core/index Reporter: Michael McCandless Assignee: Michael McCandless Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: LUCENE-2308-10.patch, LUCENE-2308-11.patch, LUCENE-2308-12.patch, LUCENE-2308-13.patch, LUCENE-2308-14.patch, LUCENE-2308-15.patch, LUCENE-2308-16.patch, LUCENE-2308-17.patch, LUCENE-2308-18.patch, LUCENE-2308-19.patch, LUCENE-2308-2.patch, LUCENE-2308-20.patch, LUCENE-2308-21.patch, LUCENE-2308-3.patch, LUCENE-2308-4.patch, LUCENE-2308-5.patch, LUCENE-2308-6.patch, LUCENE-2308-7.patch, LUCENE-2308-8.patch, LUCENE-2308-9.patch, LUCENE-2308-branch.patch, LUCENE-2308-final.patch, LUCENE-2308-ltc.patch, LUCENE-2308-merge-1.patch, LUCENE-2308-merge-2.patch, LUCENE-2308-merge-3.patch, LUCENE-2308.branchdiffs, LUCENE-2308.branchdiffs.moved, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094478#comment-13094478 ] Robert Muir commented on LUCENE-2308: - {quote} In general the builder pattern can also change existing objects, like StringBuilder does {quote} And thats another bug in the visitor anti-pattern, if you want to have a resulting immutable form, thats going to require either an object-creation orgy or a massive code duplication so that it can store an internal mutable form. Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: core/index Reporter: Michael McCandless Assignee: Michael McCandless Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: LUCENE-2308-10.patch, LUCENE-2308-11.patch, LUCENE-2308-12.patch, LUCENE-2308-13.patch, LUCENE-2308-14.patch, LUCENE-2308-15.patch, LUCENE-2308-16.patch, LUCENE-2308-17.patch, LUCENE-2308-18.patch, LUCENE-2308-19.patch, LUCENE-2308-2.patch, LUCENE-2308-20.patch, LUCENE-2308-21.patch, LUCENE-2308-3.patch, LUCENE-2308-4.patch, LUCENE-2308-5.patch, LUCENE-2308-6.patch, LUCENE-2308-7.patch, LUCENE-2308-8.patch, LUCENE-2308-9.patch, LUCENE-2308-branch.patch, LUCENE-2308-final.patch, LUCENE-2308-ltc.patch, LUCENE-2308-merge-1.patch, LUCENE-2308-merge-2.patch, LUCENE-2308-merge-3.patch, LUCENE-2308.branchdiffs, LUCENE-2308.branchdiffs.moved, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094491#comment-13094491 ] Simon Willnauer commented on LUCENE-2308: - bq. Second time this morning you didn't even read what I said. I did but apparently we talk about different things? the entire purpose of FT is that you don't have to create it multiple times so folks can create Field each time but they should reuse FT, no? I personally talk about createing FT using a builder but what uwe says is we can also do that for field though. Again how do you create way more object when you use a builder than when you use the ctor? Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: core/index Reporter: Michael McCandless Assignee: Michael McCandless Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: LUCENE-2308-10.patch, LUCENE-2308-11.patch, LUCENE-2308-12.patch, LUCENE-2308-13.patch, LUCENE-2308-14.patch, LUCENE-2308-15.patch, LUCENE-2308-16.patch, LUCENE-2308-17.patch, LUCENE-2308-18.patch, LUCENE-2308-19.patch, LUCENE-2308-2.patch, LUCENE-2308-20.patch, LUCENE-2308-21.patch, LUCENE-2308-3.patch, LUCENE-2308-4.patch, LUCENE-2308-5.patch, LUCENE-2308-6.patch, LUCENE-2308-7.patch, LUCENE-2308-8.patch, LUCENE-2308-9.patch, LUCENE-2308-branch.patch, LUCENE-2308-final.patch, LUCENE-2308-ltc.patch, LUCENE-2308-merge-1.patch, LUCENE-2308-merge-2.patch, LUCENE-2308-merge-3.patch, LUCENE-2308.branchdiffs, LUCENE-2308.branchdiffs.moved, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094492#comment-13094492 ] Chris Male commented on LUCENE-2308: I'm confused about what the reuse of Field objects has to do with this? That seems a corollary issue. Aren't we talking about reducing the cost of creating FieldType instances? Which as Simon said, are then shared. Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: core/index Reporter: Michael McCandless Assignee: Michael McCandless Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: LUCENE-2308-10.patch, LUCENE-2308-11.patch, LUCENE-2308-12.patch, LUCENE-2308-13.patch, LUCENE-2308-14.patch, LUCENE-2308-15.patch, LUCENE-2308-16.patch, LUCENE-2308-17.patch, LUCENE-2308-18.patch, LUCENE-2308-19.patch, LUCENE-2308-2.patch, LUCENE-2308-20.patch, LUCENE-2308-21.patch, LUCENE-2308-3.patch, LUCENE-2308-4.patch, LUCENE-2308-5.patch, LUCENE-2308-6.patch, LUCENE-2308-7.patch, LUCENE-2308-8.patch, LUCENE-2308-9.patch, LUCENE-2308-branch.patch, LUCENE-2308-final.patch, LUCENE-2308-ltc.patch, LUCENE-2308-merge-1.patch, LUCENE-2308-merge-2.patch, LUCENE-2308-merge-3.patch, LUCENE-2308.branchdiffs, LUCENE-2308.branchdiffs.moved, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094496#comment-13094496 ] Robert Muir commented on LUCENE-2308: - Its shared only if the person reuses them explicitly, but if they arent reusing fields (like most people don't do), then they arent likely to reuse fieldtypes either. In general, I think we shouldnt create so many objects or add so much complexity to the indexing loop. Personally I just dont think in practice people are going to set things up so that they actually reuse fieldtype. Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: core/index Reporter: Michael McCandless Assignee: Michael McCandless Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: LUCENE-2308-10.patch, LUCENE-2308-11.patch, LUCENE-2308-12.patch, LUCENE-2308-13.patch, LUCENE-2308-14.patch, LUCENE-2308-15.patch, LUCENE-2308-16.patch, LUCENE-2308-17.patch, LUCENE-2308-18.patch, LUCENE-2308-19.patch, LUCENE-2308-2.patch, LUCENE-2308-20.patch, LUCENE-2308-21.patch, LUCENE-2308-3.patch, LUCENE-2308-4.patch, LUCENE-2308-5.patch, LUCENE-2308-6.patch, LUCENE-2308-7.patch, LUCENE-2308-8.patch, LUCENE-2308-9.patch, LUCENE-2308-branch.patch, LUCENE-2308-final.patch, LUCENE-2308-ltc.patch, LUCENE-2308-merge-1.patch, LUCENE-2308-merge-2.patch, LUCENE-2308-merge-3.patch, LUCENE-2308.branchdiffs, LUCENE-2308.branchdiffs.moved, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094499#comment-13094499 ] Michael McCandless commented on LUCENE-2308: bq. Change FieldType to an interface inside index.* and use it for the source of properties about an IndexableField. +1, I think we should have an oal.index.FieldType interface, that exposes (get-only) methods. Ie, we'd just move the getters out of IndexableField into this new FT interface (likewise for StorableField). This interface should be marked as experimental, ie, we are free to change it. bq. Add a builder for FieldType to document.* which will create FieldType instances. I don't think we should use a builder API here; I think either big-ctor-takes-all-settings and so all fields are final, or what we have today (.freeze()) is better. There are two things I don't like about the builder pattern: setter chaining and the object overhead of hard immutability. On setter chaining: * It's two ways to do the same thing (chaining or not); generally an API (and a PL) should offer one (obvious) way to do things. Suddenly we'll see tutorials and articles etc. online, some with chaining, some without, and some mixed. * Code is less readable w/ chaining: it makes it easy to sneak in multiple statements per line, embed them into other statements, etc., vs unchained where you always have one statement per line * I don't like .indexed() as a name; I prefer .setIndexed() so it's clear you setting something about the object. * In encourages inefficient code, because it's easy to inline new X().this().that() when in fact the app really should create reuse FieldType up front. This is trappy -- the app doesn't realize they're creating N+1 objects. I also don't like the hard immutability (every field is final so every setter returns a new object) since this will mean the typical use is creating tons of objects per field per doc. Yes we can have a mutable builder with a .build() in the end but that's making the API even more cumbersome. In contrast, the soft immutability we have now (freeze) is very effective, and creates no additional objects: it will prevent you from altering a FT instance once any Field uses it. Really the immutability is a minor detail of the implementation here; we only need it to prevent this trap. Generally we should try to keep Lucene's core APIs as plain/simple/straightforward as possible. Someone can always later layer on a builder API on top of the simpler setter+freeze or all-properties-to-ctor API, but, not vice/versa (efficiently anyway). Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: core/index Reporter: Michael McCandless Assignee: Michael McCandless Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: LUCENE-2308-10.patch, LUCENE-2308-11.patch, LUCENE-2308-12.patch, LUCENE-2308-13.patch, LUCENE-2308-14.patch, LUCENE-2308-15.patch, LUCENE-2308-16.patch, LUCENE-2308-17.patch, LUCENE-2308-18.patch, LUCENE-2308-19.patch, LUCENE-2308-2.patch, LUCENE-2308-20.patch, LUCENE-2308-21.patch, LUCENE-2308-3.patch, LUCENE-2308-4.patch, LUCENE-2308-5.patch, LUCENE-2308-6.patch, LUCENE-2308-7.patch, LUCENE-2308-8.patch, LUCENE-2308-9.patch, LUCENE-2308-branch.patch, LUCENE-2308-final.patch, LUCENE-2308-ltc.patch, LUCENE-2308-merge-1.patch, LUCENE-2308-merge-2.patch, LUCENE-2308-merge-3.patch, LUCENE-2308.branchdiffs, LUCENE-2308.branchdiffs.moved, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold
[jira] [Commented] (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094502#comment-13094502 ] Chris Male commented on LUCENE-2308: Didn't Simon suggest we add the big-ctor version to core? {quote} why don't we put plain old immutable java objects with a single ctor into core and add a builder API into modules / sandbox? {quote} So yes, Lucene's core can stay lean and mean, but we can have the builder is userland / module / sandbox Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: core/index Reporter: Michael McCandless Assignee: Michael McCandless Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: LUCENE-2308-10.patch, LUCENE-2308-11.patch, LUCENE-2308-12.patch, LUCENE-2308-13.patch, LUCENE-2308-14.patch, LUCENE-2308-15.patch, LUCENE-2308-16.patch, LUCENE-2308-17.patch, LUCENE-2308-18.patch, LUCENE-2308-19.patch, LUCENE-2308-2.patch, LUCENE-2308-20.patch, LUCENE-2308-21.patch, LUCENE-2308-3.patch, LUCENE-2308-4.patch, LUCENE-2308-5.patch, LUCENE-2308-6.patch, LUCENE-2308-7.patch, LUCENE-2308-8.patch, LUCENE-2308-9.patch, LUCENE-2308-branch.patch, LUCENE-2308-final.patch, LUCENE-2308-ltc.patch, LUCENE-2308-merge-1.patch, LUCENE-2308-merge-2.patch, LUCENE-2308-merge-3.patch, LUCENE-2308.branchdiffs, LUCENE-2308.branchdiffs.moved, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3130) Use BoostAttribute in in TokenFilters to denote Terms that QueryParser should give lower boosts
[ https://issues.apache.org/jira/browse/LUCENE-3130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094505#comment-13094505 ] Jan Høydahl commented on LUCENE-3130: - Robert, two fields work great for supporting stuff like phonetic and stem/non-stem search, and also lower/exact-case search although index size could be lower with a one-field approach. Let's those use cases rest for now. But for the synonym case, what remains is to modify the QueryParser to act on the already-present TypeAttribute, is that so? If so, let's open another issue for that. Use BoostAttribute in in TokenFilters to denote Terms that QueryParser should give lower boosts --- Key: LUCENE-3130 URL: https://issues.apache.org/jira/browse/LUCENE-3130 Project: Lucene - Java Issue Type: Improvement Reporter: Hoss Man A recent thread asked if there was anyway to use QueryTime synonyms such that matches on the original term specified by the user would score higher then matches on the synonym. It occurred to me later that a float Attribute could be set by the SynonymFilter in such situations, and QueryParser could use that float as a boost in the resulting Query. IThis would be fairly straightforward for the simple synonyms = BooleamQuery case, but we'd have to decide how to handle the case of synonyms with multiple terms that produce MTPQ, possibly just punt for now) Likewise, there may be other TokenFilters that inject artificial tokens at query time where it also might make sense to have a reduced boost factor... * SynonymFilter * CommonGramsFilter * WordDelimiterFilter * etc... In all of these cases, the amount of the boost could me configured, and for back compact could default to 1.0 (or null to not set a boost at all) Furthermore: if we add a new BoostAttrToPayloadAttrFilter that just copied the boost attribute into the payload attribute, these same filters could give penalizing payloads to terms when used at index time) could give penalizing payloads to terms. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Issue Comment Edited] (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094502#comment-13094502 ] Chris Male edited comment on LUCENE-2308 at 8/31/11 1:11 PM: - Didn't Simon suggest we add the big-ctor version to core? {quote} why don't we put plain old immutable java objects with a single ctor into core and add a builder API into modules / sandbox? {quote} So yes, Lucene's core can stay lean and mean, but we can have the builder in userland / module / sandbox was (Author: cmale): Didn't Simon suggest we add the big-ctor version to core? {quote} why don't we put plain old immutable java objects with a single ctor into core and add a builder API into modules / sandbox? {quote} So yes, Lucene's core can stay lean and mean, but we can have the builder is userland / module / sandbox Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: core/index Reporter: Michael McCandless Assignee: Michael McCandless Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: LUCENE-2308-10.patch, LUCENE-2308-11.patch, LUCENE-2308-12.patch, LUCENE-2308-13.patch, LUCENE-2308-14.patch, LUCENE-2308-15.patch, LUCENE-2308-16.patch, LUCENE-2308-17.patch, LUCENE-2308-18.patch, LUCENE-2308-19.patch, LUCENE-2308-2.patch, LUCENE-2308-20.patch, LUCENE-2308-21.patch, LUCENE-2308-3.patch, LUCENE-2308-4.patch, LUCENE-2308-5.patch, LUCENE-2308-6.patch, LUCENE-2308-7.patch, LUCENE-2308-8.patch, LUCENE-2308-9.patch, LUCENE-2308-branch.patch, LUCENE-2308-final.patch, LUCENE-2308-ltc.patch, LUCENE-2308-merge-1.patch, LUCENE-2308-merge-2.patch, LUCENE-2308-merge-3.patch, LUCENE-2308.branchdiffs, LUCENE-2308.branchdiffs.moved, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094514#comment-13094514 ] Uwe Schindler commented on LUCENE-2308: --- I am on opposite side: In general the constructor of the immutable class is hidden (package-private or private depending on class hierarchy). So nobody can use it. The only API the user sees is the builder pattern. By that we only have *one* API and one usage type. Builder patterns can be formatted very nice and it does not matter if people do: {code:java} Field.Builder b = new Field.Builder(); b.setFoo(); b.setBar(); Field f = b.build(); {code} versus {code:java} Field f = new Field.Builder() .setFoo() .setBar() .build(); {code} The last chaining one is even more readable, and that is why *I* prefer builders. A so called telescoping constructor is the antipattern because its completely unreadable, as Java lacks of named parameters [the best example is WordDelimiterFilter, that one is horrible - a typical candidate for WordDelimiterFilter.Builder subclass). The chaining code is for stack based machines like the JVM and the x86 processors also more natural than the first one. The return value of the previous call resides already on the stack after the method returns, but instead of popping it and pushing again, it can stay there and you simply add the parameters of the next method call. This leads to also very elegant bytecode, for which hotspot has optimizations :-) About code duplication: You can in the hidden ctor of the immutable class make a clone of the builder and keep it somewhere private final inside the instance. This one then holds the unmodifiable instance state. About number of objects (yes, we have the builder object and possibly a clone to it as suggested before and finally the immutable object): The number of objects is really nonsense here as all of those will be created in the Eden space and disappear as soon as the loop/method exits. You can try autoboxing with a recent JavaVM - you would in most cases see no slowdown caused by autoboxing. These are problems from pre-2000 when we had Java 1.1. Uwe Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: core/index Reporter: Michael McCandless Assignee: Michael McCandless Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: LUCENE-2308-10.patch, LUCENE-2308-11.patch, LUCENE-2308-12.patch, LUCENE-2308-13.patch, LUCENE-2308-14.patch, LUCENE-2308-15.patch, LUCENE-2308-16.patch, LUCENE-2308-17.patch, LUCENE-2308-18.patch, LUCENE-2308-19.patch, LUCENE-2308-2.patch, LUCENE-2308-20.patch, LUCENE-2308-21.patch, LUCENE-2308-3.patch, LUCENE-2308-4.patch, LUCENE-2308-5.patch, LUCENE-2308-6.patch, LUCENE-2308-7.patch, LUCENE-2308-8.patch, LUCENE-2308-9.patch, LUCENE-2308-branch.patch, LUCENE-2308-final.patch, LUCENE-2308-ltc.patch, LUCENE-2308-merge-1.patch, LUCENE-2308-merge-2.patch, LUCENE-2308-merge-3.patch, LUCENE-2308.branchdiffs, LUCENE-2308.branchdiffs.moved, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094517#comment-13094517 ] Simon Willnauer commented on LUCENE-2308: - awesome writeup uwe! thank you! Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: core/index Reporter: Michael McCandless Assignee: Michael McCandless Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: LUCENE-2308-10.patch, LUCENE-2308-11.patch, LUCENE-2308-12.patch, LUCENE-2308-13.patch, LUCENE-2308-14.patch, LUCENE-2308-15.patch, LUCENE-2308-16.patch, LUCENE-2308-17.patch, LUCENE-2308-18.patch, LUCENE-2308-19.patch, LUCENE-2308-2.patch, LUCENE-2308-20.patch, LUCENE-2308-21.patch, LUCENE-2308-3.patch, LUCENE-2308-4.patch, LUCENE-2308-5.patch, LUCENE-2308-6.patch, LUCENE-2308-7.patch, LUCENE-2308-8.patch, LUCENE-2308-9.patch, LUCENE-2308-branch.patch, LUCENE-2308-final.patch, LUCENE-2308-ltc.patch, LUCENE-2308-merge-1.patch, LUCENE-2308-merge-2.patch, LUCENE-2308-merge-3.patch, LUCENE-2308.branchdiffs, LUCENE-2308.branchdiffs.moved, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094522#comment-13094522 ] Robert Muir commented on LUCENE-2308: - {quote} In general the constructor of the immutable class is hidden (package-private or private depending on class hierarchy). So nobody can use it. The only API the user sees is the builder pattern. {quote} I am strongly against this: there is no reason to do this. We should instead expose the constructor of the immutable class so that people who want builders can use them, but i don't want builders, i shouldnt have to. there is no reason for this. Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: core/index Reporter: Michael McCandless Assignee: Michael McCandless Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: LUCENE-2308-10.patch, LUCENE-2308-11.patch, LUCENE-2308-12.patch, LUCENE-2308-13.patch, LUCENE-2308-14.patch, LUCENE-2308-15.patch, LUCENE-2308-16.patch, LUCENE-2308-17.patch, LUCENE-2308-18.patch, LUCENE-2308-19.patch, LUCENE-2308-2.patch, LUCENE-2308-20.patch, LUCENE-2308-21.patch, LUCENE-2308-3.patch, LUCENE-2308-4.patch, LUCENE-2308-5.patch, LUCENE-2308-6.patch, LUCENE-2308-7.patch, LUCENE-2308-8.patch, LUCENE-2308-9.patch, LUCENE-2308-branch.patch, LUCENE-2308-final.patch, LUCENE-2308-ltc.patch, LUCENE-2308-merge-1.patch, LUCENE-2308-merge-2.patch, LUCENE-2308-merge-3.patch, LUCENE-2308.branchdiffs, LUCENE-2308.branchdiffs.moved, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094526#comment-13094526 ] Uwe Schindler commented on LUCENE-2308: --- If we release the code with the builder pattern then there is only one possibility and one example code in the class description. If somebody does not like the builder pattern, who cares? If there is nothing else, you have to use it. PERIOD. Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: core/index Reporter: Michael McCandless Assignee: Michael McCandless Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: LUCENE-2308-10.patch, LUCENE-2308-11.patch, LUCENE-2308-12.patch, LUCENE-2308-13.patch, LUCENE-2308-14.patch, LUCENE-2308-15.patch, LUCENE-2308-16.patch, LUCENE-2308-17.patch, LUCENE-2308-18.patch, LUCENE-2308-19.patch, LUCENE-2308-2.patch, LUCENE-2308-20.patch, LUCENE-2308-21.patch, LUCENE-2308-3.patch, LUCENE-2308-4.patch, LUCENE-2308-5.patch, LUCENE-2308-6.patch, LUCENE-2308-7.patch, LUCENE-2308-8.patch, LUCENE-2308-9.patch, LUCENE-2308-branch.patch, LUCENE-2308-final.patch, LUCENE-2308-ltc.patch, LUCENE-2308-merge-1.patch, LUCENE-2308-merge-2.patch, LUCENE-2308-merge-3.patch, LUCENE-2308.branchdiffs, LUCENE-2308.branchdiffs.moved, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Issue Comment Edited] (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094526#comment-13094526 ] Uwe Schindler edited comment on LUCENE-2308 at 8/31/11 1:41 PM: If we release the code with the builder pattern then there is only one possibility and one example code in the class description. If somebody does not like the builder pattern, who cares? If there is nothing else, you have to use it. PERIOD. bq. About code duplication: You can in the hidden ctor of the immutable class make a clone of the builder and keep it somewhere private final inside the instance. This one then holds the unmodifiable instance state. I already explained: The code duplication comes from the two ways to do it. Of course for lovers of telescopic unreadbale methods we can still add some conventional factory methods, taking tons of parameters, but internally use the builder. The user would not see the builder. was (Author: thetaphi): If we release the code with the builder pattern then there is only one possibility and one example code in the class description. If somebody does not like the builder pattern, who cares? If there is nothing else, you have to use it. PERIOD. Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: core/index Reporter: Michael McCandless Assignee: Michael McCandless Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: LUCENE-2308-10.patch, LUCENE-2308-11.patch, LUCENE-2308-12.patch, LUCENE-2308-13.patch, LUCENE-2308-14.patch, LUCENE-2308-15.patch, LUCENE-2308-16.patch, LUCENE-2308-17.patch, LUCENE-2308-18.patch, LUCENE-2308-19.patch, LUCENE-2308-2.patch, LUCENE-2308-20.patch, LUCENE-2308-21.patch, LUCENE-2308-3.patch, LUCENE-2308-4.patch, LUCENE-2308-5.patch, LUCENE-2308-6.patch, LUCENE-2308-7.patch, LUCENE-2308-8.patch, LUCENE-2308-9.patch, LUCENE-2308-branch.patch, LUCENE-2308-final.patch, LUCENE-2308-ltc.patch, LUCENE-2308-merge-1.patch, LUCENE-2308-merge-2.patch, LUCENE-2308-merge-3.patch, LUCENE-2308.branchdiffs, LUCENE-2308.branchdiffs.moved, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094527#comment-13094527 ] Robert Muir commented on LUCENE-2308: - I care, thats why i am -1 to the builder pattern. The pro-builders on this issue just silently argue that my concerns don't matter. Mike gave his opinion on it too. Stating that our concerns are meaningless is not the way to create consensus towards a good solution here. Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: core/index Reporter: Michael McCandless Assignee: Michael McCandless Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: LUCENE-2308-10.patch, LUCENE-2308-11.patch, LUCENE-2308-12.patch, LUCENE-2308-13.patch, LUCENE-2308-14.patch, LUCENE-2308-15.patch, LUCENE-2308-16.patch, LUCENE-2308-17.patch, LUCENE-2308-18.patch, LUCENE-2308-19.patch, LUCENE-2308-2.patch, LUCENE-2308-20.patch, LUCENE-2308-21.patch, LUCENE-2308-3.patch, LUCENE-2308-4.patch, LUCENE-2308-5.patch, LUCENE-2308-6.patch, LUCENE-2308-7.patch, LUCENE-2308-8.patch, LUCENE-2308-9.patch, LUCENE-2308-branch.patch, LUCENE-2308-final.patch, LUCENE-2308-ltc.patch, LUCENE-2308-merge-1.patch, LUCENE-2308-merge-2.patch, LUCENE-2308-merge-3.patch, LUCENE-2308.branchdiffs, LUCENE-2308.branchdiffs.moved, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094529#comment-13094529 ] Chris Male commented on LUCENE-2308: {quote} The pro-builders on this issue just silently argue that my concerns don't matter. {quote} I resent that. I've actively tried to understand your concerns and reach a compromise and consensus. Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: core/index Reporter: Michael McCandless Assignee: Michael McCandless Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: LUCENE-2308-10.patch, LUCENE-2308-11.patch, LUCENE-2308-12.patch, LUCENE-2308-13.patch, LUCENE-2308-14.patch, LUCENE-2308-15.patch, LUCENE-2308-16.patch, LUCENE-2308-17.patch, LUCENE-2308-18.patch, LUCENE-2308-19.patch, LUCENE-2308-2.patch, LUCENE-2308-20.patch, LUCENE-2308-21.patch, LUCENE-2308-3.patch, LUCENE-2308-4.patch, LUCENE-2308-5.patch, LUCENE-2308-6.patch, LUCENE-2308-7.patch, LUCENE-2308-8.patch, LUCENE-2308-9.patch, LUCENE-2308-branch.patch, LUCENE-2308-final.patch, LUCENE-2308-ltc.patch, LUCENE-2308-merge-1.patch, LUCENE-2308-merge-2.patch, LUCENE-2308-merge-3.patch, LUCENE-2308.branchdiffs, LUCENE-2308.branchdiffs.moved, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094531#comment-13094531 ] Robert Muir commented on LUCENE-2308: - {quote} So yes, Lucene's core can stay lean and mean, but we can have the builder in userland / module / sandbox {quote} Chris, personally I think this is a reasonable solution, but my arguments are instead against the other ridiculous statements on the issue implying that my concerns do not matter. The original idea for using a simple java object was just this, so that people can do whatever they want (builders, whatever). But there is no reason to enforce any specific anti-pattern here, when we can just leave that to the application. Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: core/index Reporter: Michael McCandless Assignee: Michael McCandless Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: LUCENE-2308-10.patch, LUCENE-2308-11.patch, LUCENE-2308-12.patch, LUCENE-2308-13.patch, LUCENE-2308-14.patch, LUCENE-2308-15.patch, LUCENE-2308-16.patch, LUCENE-2308-17.patch, LUCENE-2308-18.patch, LUCENE-2308-19.patch, LUCENE-2308-2.patch, LUCENE-2308-20.patch, LUCENE-2308-21.patch, LUCENE-2308-3.patch, LUCENE-2308-4.patch, LUCENE-2308-5.patch, LUCENE-2308-6.patch, LUCENE-2308-7.patch, LUCENE-2308-8.patch, LUCENE-2308-9.patch, LUCENE-2308-branch.patch, LUCENE-2308-final.patch, LUCENE-2308-ltc.patch, LUCENE-2308-merge-1.patch, LUCENE-2308-merge-2.patch, LUCENE-2308-merge-3.patch, LUCENE-2308.branchdiffs, LUCENE-2308.branchdiffs.moved, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094532#comment-13094532 ] Yonik Seeley commented on LUCENE-2308: -- bq. I think we should have an oal.index.FieldType interface, that exposes (get-only) methods. +1 I also don't see a lot of value in jumping through too many hoops trying to enforce immutability (vs just making it easy for people to avoid common mistakes). Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: core/index Reporter: Michael McCandless Assignee: Michael McCandless Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: LUCENE-2308-10.patch, LUCENE-2308-11.patch, LUCENE-2308-12.patch, LUCENE-2308-13.patch, LUCENE-2308-14.patch, LUCENE-2308-15.patch, LUCENE-2308-16.patch, LUCENE-2308-17.patch, LUCENE-2308-18.patch, LUCENE-2308-19.patch, LUCENE-2308-2.patch, LUCENE-2308-20.patch, LUCENE-2308-21.patch, LUCENE-2308-3.patch, LUCENE-2308-4.patch, LUCENE-2308-5.patch, LUCENE-2308-6.patch, LUCENE-2308-7.patch, LUCENE-2308-8.patch, LUCENE-2308-9.patch, LUCENE-2308-branch.patch, LUCENE-2308-final.patch, LUCENE-2308-ltc.patch, LUCENE-2308-merge-1.patch, LUCENE-2308-merge-2.patch, LUCENE-2308-merge-3.patch, LUCENE-2308.branchdiffs, LUCENE-2308.branchdiffs.moved, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Assigned] (SOLR-2454) Would like link in site navigation to the ManifoldCF project
[ https://issues.apache.org/jira/browse/SOLR-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Høydahl reassigned SOLR-2454: - Assignee: Jan Høydahl Would like link in site navigation to the ManifoldCF project Key: SOLR-2454 URL: https://issues.apache.org/jira/browse/SOLR-2454 Project: Solr Issue Type: Improvement Components: documentation Reporter: Karl Wright Assignee: Jan Høydahl Priority: Minor Attachments: SOLR-2454.patch The Solr/Lucene site points to lots of other Apache projects. It would be nice if it also pointed to ManifoldCF. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3130) Use BoostAttribute in in TokenFilters to denote Terms that QueryParser should give lower boosts
[ https://issues.apache.org/jira/browse/LUCENE-3130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094535#comment-13094535 ] Robert Muir commented on LUCENE-3130: - {quote} But for the synonym case, what remains is to modify the QueryParser to act on the already-present TypeAttribute, is that so? If so, let's open another issue for that. {quote} I think so? Though it might be more useful not to modify the core queryparser for this? The reason is that such a feature is geared towards synonyms and multi-word synonyms don't work well with it... So maybe instead to a simpler queryparser that *does* work well with multi-word synonyms by default? Use BoostAttribute in in TokenFilters to denote Terms that QueryParser should give lower boosts --- Key: LUCENE-3130 URL: https://issues.apache.org/jira/browse/LUCENE-3130 Project: Lucene - Java Issue Type: Improvement Reporter: Hoss Man A recent thread asked if there was anyway to use QueryTime synonyms such that matches on the original term specified by the user would score higher then matches on the synonym. It occurred to me later that a float Attribute could be set by the SynonymFilter in such situations, and QueryParser could use that float as a boost in the resulting Query. IThis would be fairly straightforward for the simple synonyms = BooleamQuery case, but we'd have to decide how to handle the case of synonyms with multiple terms that produce MTPQ, possibly just punt for now) Likewise, there may be other TokenFilters that inject artificial tokens at query time where it also might make sense to have a reduced boost factor... * SynonymFilter * CommonGramsFilter * WordDelimiterFilter * etc... In all of these cases, the amount of the boost could me configured, and for back compact could default to 1.0 (or null to not set a boost at all) Furthermore: if we add a new BoostAttrToPayloadAttrFilter that just copied the boost attribute into the payload attribute, these same filters could give penalizing payloads to terms when used at index time) could give penalizing payloads to terms. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094539#comment-13094539 ] Chris Male commented on LUCENE-2308: Alright, so can we move towards a consensus on a solution? So far I see people are okay with: - Moving FieldType over to an interface which exposes get only methods - Creating the core implementation which uses a ctor with final fields - Builder API can be created and placed in a yet to be determined place. Sweet? Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: core/index Reporter: Michael McCandless Assignee: Michael McCandless Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: LUCENE-2308-10.patch, LUCENE-2308-11.patch, LUCENE-2308-12.patch, LUCENE-2308-13.patch, LUCENE-2308-14.patch, LUCENE-2308-15.patch, LUCENE-2308-16.patch, LUCENE-2308-17.patch, LUCENE-2308-18.patch, LUCENE-2308-19.patch, LUCENE-2308-2.patch, LUCENE-2308-20.patch, LUCENE-2308-21.patch, LUCENE-2308-3.patch, LUCENE-2308-4.patch, LUCENE-2308-5.patch, LUCENE-2308-6.patch, LUCENE-2308-7.patch, LUCENE-2308-8.patch, LUCENE-2308-9.patch, LUCENE-2308-branch.patch, LUCENE-2308-final.patch, LUCENE-2308-ltc.patch, LUCENE-2308-merge-1.patch, LUCENE-2308-merge-2.patch, LUCENE-2308-merge-3.patch, LUCENE-2308.branchdiffs, LUCENE-2308.branchdiffs.moved, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094541#comment-13094541 ] Chris Male commented on LUCENE-2308: Err Yonik pointed out that we still have the option of continuing to use the freezable 'soft' immutability. I didn't mean to ignore it. Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: core/index Reporter: Michael McCandless Assignee: Michael McCandless Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: LUCENE-2308-10.patch, LUCENE-2308-11.patch, LUCENE-2308-12.patch, LUCENE-2308-13.patch, LUCENE-2308-14.patch, LUCENE-2308-15.patch, LUCENE-2308-16.patch, LUCENE-2308-17.patch, LUCENE-2308-18.patch, LUCENE-2308-19.patch, LUCENE-2308-2.patch, LUCENE-2308-20.patch, LUCENE-2308-21.patch, LUCENE-2308-3.patch, LUCENE-2308-4.patch, LUCENE-2308-5.patch, LUCENE-2308-6.patch, LUCENE-2308-7.patch, LUCENE-2308-8.patch, LUCENE-2308-9.patch, LUCENE-2308-branch.patch, LUCENE-2308-final.patch, LUCENE-2308-ltc.patch, LUCENE-2308-merge-1.patch, LUCENE-2308-merge-2.patch, LUCENE-2308-merge-3.patch, LUCENE-2308.branchdiffs, LUCENE-2308.branchdiffs.moved, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (SOLR-2454) Would like link in site navigation to the ManifoldCF project
[ https://issues.apache.org/jira/browse/SOLR-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Høydahl resolved SOLR-2454. --- Resolution: Fixed Checked in on trunk, not yet deployed to live site Would like link in site navigation to the ManifoldCF project Key: SOLR-2454 URL: https://issues.apache.org/jira/browse/SOLR-2454 Project: Solr Issue Type: Improvement Components: documentation Reporter: Karl Wright Assignee: Jan Høydahl Priority: Minor Attachments: SOLR-2454.patch The Solr/Lucene site points to lots of other Apache projects. It would be nice if it also pointed to ManifoldCF. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094543#comment-13094543 ] Uwe Schindler commented on LUCENE-2308: --- I disagree, because that would again create different usage patterns and more questions on the user. If we only have one way to do it (I favour the builder pattern) with a code example (like NumericRangeQuery does in its javadocs) this is all obvious to users. I think telescopic ctors/methods are an antipattern because of readability and I think also Robert will agree with me that e.g. WordDelimiterFilter is unuseable. Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: core/index Reporter: Michael McCandless Assignee: Michael McCandless Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: LUCENE-2308-10.patch, LUCENE-2308-11.patch, LUCENE-2308-12.patch, LUCENE-2308-13.patch, LUCENE-2308-14.patch, LUCENE-2308-15.patch, LUCENE-2308-16.patch, LUCENE-2308-17.patch, LUCENE-2308-18.patch, LUCENE-2308-19.patch, LUCENE-2308-2.patch, LUCENE-2308-20.patch, LUCENE-2308-21.patch, LUCENE-2308-3.patch, LUCENE-2308-4.patch, LUCENE-2308-5.patch, LUCENE-2308-6.patch, LUCENE-2308-7.patch, LUCENE-2308-8.patch, LUCENE-2308-9.patch, LUCENE-2308-branch.patch, LUCENE-2308-final.patch, LUCENE-2308-ltc.patch, LUCENE-2308-merge-1.patch, LUCENE-2308-merge-2.patch, LUCENE-2308-merge-3.patch, LUCENE-2308.branchdiffs, LUCENE-2308.branchdiffs.moved, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3406) Add source packaging targets that make a tarball from a local working copy
[ https://issues.apache.org/jira/browse/LUCENE-3406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rowe updated LUCENE-3406: Affects Version/s: 3.3 Fix Version/s: 3.4 Summary: Add source packaging targets that make a tarball from a local working copy (was: Add source distribution packaging targets that make a tarball from a local working copy) Add source packaging targets that make a tarball from a local working copy -- Key: LUCENE-3406 URL: https://issues.apache.org/jira/browse/LUCENE-3406 Project: Lucene - Java Issue Type: Improvement Components: general/build Affects Versions: 3.3, 4.0 Reporter: Seung-Yeoul Yang Assignee: Steven Rowe Priority: Minor Labels: patch Fix For: 3.4, 4.0 Attachments: LUCENE-3406.patch, LUCENE-3406.patch Original Estimate: 24h Remaining Estimate: 24h I am adding back targets that were removed in https://issues.apache.org/jira/browse/LUCENE-2973 that are used to create source distribution packaging from a local working copy as new Ant targets. 2 things to note about the patch: 1) For package-local-src-tgz in solr/build.xml, I had to specify additional directories under solr/ that have been added since LUCENE-2973. 2) I couldn't get the package-tgz-local-src in lucene/build.xml to generate the docs folder, which does get added by package-tgz-src. The patch is against the trunk. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3130) Use BoostAttribute in in TokenFilters to denote Terms that QueryParser should give lower boosts
[ https://issues.apache.org/jira/browse/LUCENE-3130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094545#comment-13094545 ] Jan Høydahl commented on LUCENE-3130: - The core of this issue is providing a mechanism for deboosting synonyms, and as long as it works with single-term synonyms that at least covers what most people use today. I propose we handle that first. Agree that it would be nice with a query-parser which can handle multi word synonyms. But that could be handled incrementally in a separate issue. Use BoostAttribute in in TokenFilters to denote Terms that QueryParser should give lower boosts --- Key: LUCENE-3130 URL: https://issues.apache.org/jira/browse/LUCENE-3130 Project: Lucene - Java Issue Type: Improvement Reporter: Hoss Man A recent thread asked if there was anyway to use QueryTime synonyms such that matches on the original term specified by the user would score higher then matches on the synonym. It occurred to me later that a float Attribute could be set by the SynonymFilter in such situations, and QueryParser could use that float as a boost in the resulting Query. IThis would be fairly straightforward for the simple synonyms = BooleamQuery case, but we'd have to decide how to handle the case of synonyms with multiple terms that produce MTPQ, possibly just punt for now) Likewise, there may be other TokenFilters that inject artificial tokens at query time where it also might make sense to have a reduced boost factor... * SynonymFilter * CommonGramsFilter * WordDelimiterFilter * etc... In all of these cases, the amount of the boost could me configured, and for back compact could default to 1.0 (or null to not set a boost at all) Furthermore: if we add a new BoostAttrToPayloadAttrFilter that just copied the boost attribute into the payload attribute, these same filters could give penalizing payloads to terms when used at index time) could give penalizing payloads to terms. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094547#comment-13094547 ] Robert Muir commented on LUCENE-2308: - But if fieldtype is an interface with get-only methods, then someone could make a Freezable implementation right? Maybe the interface is good, because I dislike 'forcing' freezable too, just not as much as a dislike builder. so, i think the interface sounds good, and would still personally prefer if our 'default' core implementation did not use freezable, and used the simpler ctor instead. also I think we should be gearing the API so that most people can use the simpler fieldtypes (StringField/TextField etc) for 90% of lucene uses instead: I think we want using FieldType directly to be more expert usage (e.g. i should be able to do a typical body+title+metadata fields with these StringField/TextField etc and never deal with this stuff). Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: core/index Reporter: Michael McCandless Assignee: Michael McCandless Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: LUCENE-2308-10.patch, LUCENE-2308-11.patch, LUCENE-2308-12.patch, LUCENE-2308-13.patch, LUCENE-2308-14.patch, LUCENE-2308-15.patch, LUCENE-2308-16.patch, LUCENE-2308-17.patch, LUCENE-2308-18.patch, LUCENE-2308-19.patch, LUCENE-2308-2.patch, LUCENE-2308-20.patch, LUCENE-2308-21.patch, LUCENE-2308-3.patch, LUCENE-2308-4.patch, LUCENE-2308-5.patch, LUCENE-2308-6.patch, LUCENE-2308-7.patch, LUCENE-2308-8.patch, LUCENE-2308-9.patch, LUCENE-2308-branch.patch, LUCENE-2308-final.patch, LUCENE-2308-ltc.patch, LUCENE-2308-merge-1.patch, LUCENE-2308-merge-2.patch, LUCENE-2308-merge-3.patch, LUCENE-2308.branchdiffs, LUCENE-2308.branchdiffs.moved, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Assigned] (SOLR-2687) Add new Solr book 'Apache Solr 3.1 Cookbook' to selection of Solr books and news.
[ https://issues.apache.org/jira/browse/SOLR-2687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Høydahl reassigned SOLR-2687: - Assignee: Jan Høydahl Add new Solr book 'Apache Solr 3.1 Cookbook' to selection of Solr books and news. - Key: SOLR-2687 URL: https://issues.apache.org/jira/browse/SOLR-2687 Project: Solr Issue Type: Task Reporter: Julian Copes Assignee: Jan Høydahl Attachments: solr-2687.patch, solr_31_cookbook.jpg Find below the news of the new Solr book. I can provide an image when prompted. Below is a news item and I've included the URL for the new book. The text is as follows: Rafał Kuć is proud to introduce a new book on Solr, Apache Solr 3.1 Cookbook from Packt Publishing. The Solr 3.1 Cookbook will make your everyday work easier by using real-life examples that show you how to deal with the most common problems that can arise while using the Apache Solr search engine. This cookbook will show you how to get the most out of your search engine. Each chapter covers a different aspect of working with Solr from analyzing your text data through querying, performance improvement, and developing your own modules. The practical recipes will help you to quickly solve common problems with data analysis, show you how to use faceting to collect data and to speed up the performance of Solr. You will learn about functionalities that most newbies are unaware of, such as sorting results by a function value, highlighting matched words, and computing statistics to make your work with Solr easy and stress free. Click here to read more about the Apache Solr 3.1 Cookbook. (http://www.packtpub.com/solr-3-1-enterprise-search-server-cookbook/book) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094548#comment-13094548 ] Chris Male commented on LUCENE-2308: I don't see anything wrong with providing options. Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: core/index Reporter: Michael McCandless Assignee: Michael McCandless Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: LUCENE-2308-10.patch, LUCENE-2308-11.patch, LUCENE-2308-12.patch, LUCENE-2308-13.patch, LUCENE-2308-14.patch, LUCENE-2308-15.patch, LUCENE-2308-16.patch, LUCENE-2308-17.patch, LUCENE-2308-18.patch, LUCENE-2308-19.patch, LUCENE-2308-2.patch, LUCENE-2308-20.patch, LUCENE-2308-21.patch, LUCENE-2308-3.patch, LUCENE-2308-4.patch, LUCENE-2308-5.patch, LUCENE-2308-6.patch, LUCENE-2308-7.patch, LUCENE-2308-8.patch, LUCENE-2308-9.patch, LUCENE-2308-branch.patch, LUCENE-2308-final.patch, LUCENE-2308-ltc.patch, LUCENE-2308-merge-1.patch, LUCENE-2308-merge-2.patch, LUCENE-2308-merge-3.patch, LUCENE-2308.branchdiffs, LUCENE-2308.branchdiffs.moved, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3406) Add source packaging targets that make a tarball from a local working copy
[ https://issues.apache.org/jira/browse/LUCENE-3406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rowe updated LUCENE-3406: Attachment: LUCENE-3406.patch This version of the patch makes a couple of small changes to the Solr exclude pattern (adding {{\*\*/pom.xml}} and excluding {{\*\*/\*.iws}} and {{\*\*/\*.ipr}}; these two IntelliJ config files are not used by the setup provided by {{ant idea}}), and adds {{CHANGES.txt}} entries for Solr and Lucene. I will commit shortly to trunk, then backport to branch_3x. Add source packaging targets that make a tarball from a local working copy -- Key: LUCENE-3406 URL: https://issues.apache.org/jira/browse/LUCENE-3406 Project: Lucene - Java Issue Type: Improvement Components: general/build Affects Versions: 3.3, 4.0 Reporter: Seung-Yeoul Yang Assignee: Steven Rowe Priority: Minor Labels: patch Fix For: 3.4, 4.0 Attachments: LUCENE-3406.patch, LUCENE-3406.patch, LUCENE-3406.patch Original Estimate: 24h Remaining Estimate: 24h I am adding back targets that were removed in https://issues.apache.org/jira/browse/LUCENE-2973 that are used to create source distribution packaging from a local working copy as new Ant targets. 2 things to note about the patch: 1) For package-local-src-tgz in solr/build.xml, I had to specify additional directories under solr/ that have been added since LUCENE-2973. 2) I couldn't get the package-tgz-local-src in lucene/build.xml to generate the docs folder, which does get added by package-tgz-src. The patch is against the trunk. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094551#comment-13094551 ] Chris Male commented on LUCENE-2308: Okay so we seem to have consensus on moving to a get-only interface. The question just remains how to implement the 'default' core implementation. Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: core/index Reporter: Michael McCandless Assignee: Michael McCandless Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: LUCENE-2308-10.patch, LUCENE-2308-11.patch, LUCENE-2308-12.patch, LUCENE-2308-13.patch, LUCENE-2308-14.patch, LUCENE-2308-15.patch, LUCENE-2308-16.patch, LUCENE-2308-17.patch, LUCENE-2308-18.patch, LUCENE-2308-19.patch, LUCENE-2308-2.patch, LUCENE-2308-20.patch, LUCENE-2308-21.patch, LUCENE-2308-3.patch, LUCENE-2308-4.patch, LUCENE-2308-5.patch, LUCENE-2308-6.patch, LUCENE-2308-7.patch, LUCENE-2308-8.patch, LUCENE-2308-9.patch, LUCENE-2308-branch.patch, LUCENE-2308-final.patch, LUCENE-2308-ltc.patch, LUCENE-2308-merge-1.patch, LUCENE-2308-merge-2.patch, LUCENE-2308-merge-3.patch, LUCENE-2308.branchdiffs, LUCENE-2308.branchdiffs.moved, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3130) Use BoostAttribute in in TokenFilters to denote Terms that QueryParser should give lower boosts
[ https://issues.apache.org/jira/browse/LUCENE-3130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094554#comment-13094554 ] Robert Muir commented on LUCENE-3130: - Jan, not sure that most people only use single-term synonyms... if this is the case maybe we should rethink our synonyms implementation because multi-word adds a ton of complexity! Another reason I suggested avoiding adding this to the core queryparser is because its going to be challenging to allow this optional boosting in a flexible way (just look at the getFieldQuery... its very hairy). I think in the ideal case, we somehow restructure all this code so that subclasses have more control over how the query is created... however I think this might be challenging just given how the code is structured now. The reason I think it would be best exposed as a 'hook' to subclasses (versus adding a deboost synonyms option directly to the core QP), is that I think people are going to want to customize how this works, e.g. control it per-field and things like that. At the end of the day, a queryparser could always subclass getFieldQuery completely and do this now, but thats not great either because the code is so hairy :( This kind of feature might be easier to implement with the new queryparser in contrib, but I'm not sure. Use BoostAttribute in in TokenFilters to denote Terms that QueryParser should give lower boosts --- Key: LUCENE-3130 URL: https://issues.apache.org/jira/browse/LUCENE-3130 Project: Lucene - Java Issue Type: Improvement Reporter: Hoss Man A recent thread asked if there was anyway to use QueryTime synonyms such that matches on the original term specified by the user would score higher then matches on the synonym. It occurred to me later that a float Attribute could be set by the SynonymFilter in such situations, and QueryParser could use that float as a boost in the resulting Query. IThis would be fairly straightforward for the simple synonyms = BooleamQuery case, but we'd have to decide how to handle the case of synonyms with multiple terms that produce MTPQ, possibly just punt for now) Likewise, there may be other TokenFilters that inject artificial tokens at query time where it also might make sense to have a reduced boost factor... * SynonymFilter * CommonGramsFilter * WordDelimiterFilter * etc... In all of these cases, the amount of the boost could me configured, and for back compact could default to 1.0 (or null to not set a boost at all) Furthermore: if we add a new BoostAttrToPayloadAttrFilter that just copied the boost attribute into the payload attribute, these same filters could give penalizing payloads to terms when used at index time) could give penalizing payloads to terms. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094557#comment-13094557 ] Uwe Schindler commented on LUCENE-2308: --- Freezeable is an antipattern and produces messy code on the implementation side, just because someone still stays in the 1990s when Java was not able to handle lots of small short-living objects. That's since almost the beginning of this century no issue anymore. Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: core/index Reporter: Michael McCandless Assignee: Michael McCandless Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: LUCENE-2308-10.patch, LUCENE-2308-11.patch, LUCENE-2308-12.patch, LUCENE-2308-13.patch, LUCENE-2308-14.patch, LUCENE-2308-15.patch, LUCENE-2308-16.patch, LUCENE-2308-17.patch, LUCENE-2308-18.patch, LUCENE-2308-19.patch, LUCENE-2308-2.patch, LUCENE-2308-20.patch, LUCENE-2308-21.patch, LUCENE-2308-3.patch, LUCENE-2308-4.patch, LUCENE-2308-5.patch, LUCENE-2308-6.patch, LUCENE-2308-7.patch, LUCENE-2308-8.patch, LUCENE-2308-9.patch, LUCENE-2308-branch.patch, LUCENE-2308-final.patch, LUCENE-2308-ltc.patch, LUCENE-2308-merge-1.patch, LUCENE-2308-merge-2.patch, LUCENE-2308-merge-3.patch, LUCENE-2308.branchdiffs, LUCENE-2308.branchdiffs.moved, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2726) NullPointerException when using spellcheck.q
[ https://issues.apache.org/jira/browse/SOLR-2726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bernd Fehling updated SOLR-2726: Attachment: SOLR-2726.patch According to SOLR-572 the default analyzer should be WhitespaceAnalyzer. So I added the WhitespaceAnalyzer to init of Suggester.java. Now spellcheck.q works without NPE. Tip: To get suggestions with multiple words like New Y for New York and also for New Year you can use queryAnalyzerFieldType with an analyzer having a PatternReplaceFilterFactory for e.g. _ (underscore). If you now lookup up suggestions with New_Y you will get suggestions for New York, New Year, ... NullPointerException when using spellcheck.q Key: SOLR-2726 URL: https://issues.apache.org/jira/browse/SOLR-2726 Project: Solr Issue Type: Bug Components: spellchecker Affects Versions: 3.3, 4.0 Environment: ubuntu Reporter: valentin Labels: nullpointerexception, spellcheck Attachments: SOLR-2726.patch When I use spellcheck.q in my query to define what will be spellchecked, I always have this error, for every configuration I try : java.lang.NullPointerException at org.apache.solr.handler.component.SpellCheckComponent.getTokens(SpellCheckComponent.java:476) at org.apache.solr.handler.component.SpellCheckComponent.process(SpellCheckComponent.java:131) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:202) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) All my other functions works great, this is the only thing which doesn't work at all, just when i add spellcheck.q=my%20sentence in the query... Example of a query : http://localhost:8983/solr/db/suggest_full?q=american%20israelspellcheck.q=american%20israel In solrconfig.xml : searchComponent name=suggest_full class=solr.SpellCheckComponent str name=queryAnalyzerFieldTypesuggestTextFull/str lst name=spellchecker str name=namesuggest_full/str str name=classnameorg.apache.solr.spelling.suggest.Suggester/str str name=lookupImplorg.apache.solr.spelling.suggest.tst.TSTLookup/str str name=fieldtext_suggest_full/str str name=fieldTypesuggestTextFull/str /lst /searchComponent requestHandler name=/suggest_full class=org.apache.solr.handler.component.SearchHandler lst name=defaults str name=spellchecktrue/str str name=spellcheck.dictionarysuggest_full/str str name=spellcheck.count10/str str name=spellcheck.onlyMorePopulartrue/str /lst arr name=components strsuggest_full/str /arr /requestHandler I'm using SolR 3.3, and I tried it too on SolR 4.0 -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Issue Comment Edited] (SOLR-2726) NullPointerException when using spellcheck.q
[ https://issues.apache.org/jira/browse/SOLR-2726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094558#comment-13094558 ] Bernd Fehling edited comment on SOLR-2726 at 8/31/11 2:18 PM: -- According to SOLR-572 the default analyzer should be WhitespaceAnalyzer. With this SOLR-2726.patch I added the WhitespaceAnalyzer to init of Suggester.java. Now spellcheck.q works without NPE. Tip: To get suggestions with multiple words like New Y for New York and also for New Year you can use queryAnalyzerFieldType with an analyzer having a PatternReplaceFilterFactory for e.g. _ (underscore). If you now lookup up suggestions with New_Y you will get suggestions for New York, New Year, ... was (Author: befehl): According to SOLR-572 the default analyzer should be WhitespaceAnalyzer. So I added the WhitespaceAnalyzer to init of Suggester.java. Now spellcheck.q works without NPE. Tip: To get suggestions with multiple words like New Y for New York and also for New Year you can use queryAnalyzerFieldType with an analyzer having a PatternReplaceFilterFactory for e.g. _ (underscore). If you now lookup up suggestions with New_Y you will get suggestions for New York, New Year, ... NullPointerException when using spellcheck.q Key: SOLR-2726 URL: https://issues.apache.org/jira/browse/SOLR-2726 Project: Solr Issue Type: Bug Components: spellchecker Affects Versions: 3.3, 4.0 Environment: ubuntu Reporter: valentin Labels: nullpointerexception, spellcheck Attachments: SOLR-2726.patch When I use spellcheck.q in my query to define what will be spellchecked, I always have this error, for every configuration I try : java.lang.NullPointerException at org.apache.solr.handler.component.SpellCheckComponent.getTokens(SpellCheckComponent.java:476) at org.apache.solr.handler.component.SpellCheckComponent.process(SpellCheckComponent.java:131) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:202) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) All my other functions works great, this is the only thing which doesn't work at all, just when i add spellcheck.q=my%20sentence in the query... Example of a query : http://localhost:8983/solr/db/suggest_full?q=american%20israelspellcheck.q=american%20israel In solrconfig.xml : searchComponent name=suggest_full class=solr.SpellCheckComponent str name=queryAnalyzerFieldTypesuggestTextFull/str lst name=spellchecker str name=namesuggest_full/str str name=classnameorg.apache.solr.spelling.suggest.Suggester/str str name=lookupImplorg.apache.solr.spelling.suggest.tst.TSTLookup/str str name=fieldtext_suggest_full/str str name=fieldTypesuggestTextFull/str /lst /searchComponent requestHandler name=/suggest_full class=org.apache.solr.handler.component.SearchHandler lst name=defaults str name=spellchecktrue/str str name=spellcheck.dictionarysuggest_full/str str name=spellcheck.count10/str str name=spellcheck.onlyMorePopulartrue/str /lst arr name=components
[jira] [Commented] (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094564#comment-13094564 ] Uwe Schindler commented on LUCENE-2308: --- In my opinion, we should vote for the following solutions: [1] Old-style telescopic ctors on a immutable FieldType [2] FieldType.Builder pattern with hidden FieldType Ctor and optionally static FieldType factory methods that produce commonly used types/maybe even telescopic (thise factories use builder internally / have a set of preconfigured builders available). The private ctor tkaes the Builder instance and clones it to keep state final (like IndexWriter) [3] Modifiable FieldType with a freeze() method and iffecient code because of stupid checks on every method - this is somehow a builder, the difference is only that the builder and final instance are same class. [4] Readonly interface with all three implementations Here my +1 for a easy to use Builder-only [2] implementation and nothing else. This has no additional object creation except the builder and an internal clone, but those are shortliving. Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: core/index Reporter: Michael McCandless Assignee: Michael McCandless Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: LUCENE-2308-10.patch, LUCENE-2308-11.patch, LUCENE-2308-12.patch, LUCENE-2308-13.patch, LUCENE-2308-14.patch, LUCENE-2308-15.patch, LUCENE-2308-16.patch, LUCENE-2308-17.patch, LUCENE-2308-18.patch, LUCENE-2308-19.patch, LUCENE-2308-2.patch, LUCENE-2308-20.patch, LUCENE-2308-21.patch, LUCENE-2308-3.patch, LUCENE-2308-4.patch, LUCENE-2308-5.patch, LUCENE-2308-6.patch, LUCENE-2308-7.patch, LUCENE-2308-8.patch, LUCENE-2308-9.patch, LUCENE-2308-branch.patch, LUCENE-2308-final.patch, LUCENE-2308-ltc.patch, LUCENE-2308-merge-1.patch, LUCENE-2308-merge-2.patch, LUCENE-2308-merge-3.patch, LUCENE-2308.branchdiffs, LUCENE-2308.branchdiffs.moved, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094567#comment-13094567 ] Robert Muir commented on LUCENE-2308: - {quote} Okay so we seem to have consensus on moving to a get-only interface. {quote} I'm not sure: we should see what Uwe thinks. It seems he might be against the idea that there are multiple ways to do this (I think its a valid concern, i just disagree with him though). I think the ideal situation is where StringField/TextField cover the majority of use cases and doing anything with FT is expert, e.g. intended for apps like Solr to implement the interface and probably not even use our 'default' FieldType implementation. I think the default impl is just for someone that wants something thats not out-of-box, e.g. tokenized TextField that omits TF. Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: core/index Reporter: Michael McCandless Assignee: Michael McCandless Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: LUCENE-2308-10.patch, LUCENE-2308-11.patch, LUCENE-2308-12.patch, LUCENE-2308-13.patch, LUCENE-2308-14.patch, LUCENE-2308-15.patch, LUCENE-2308-16.patch, LUCENE-2308-17.patch, LUCENE-2308-18.patch, LUCENE-2308-19.patch, LUCENE-2308-2.patch, LUCENE-2308-20.patch, LUCENE-2308-21.patch, LUCENE-2308-3.patch, LUCENE-2308-4.patch, LUCENE-2308-5.patch, LUCENE-2308-6.patch, LUCENE-2308-7.patch, LUCENE-2308-8.patch, LUCENE-2308-9.patch, LUCENE-2308-branch.patch, LUCENE-2308-final.patch, LUCENE-2308-ltc.patch, LUCENE-2308-merge-1.patch, LUCENE-2308-merge-2.patch, LUCENE-2308-merge-3.patch, LUCENE-2308.branchdiffs, LUCENE-2308.branchdiffs.moved, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094572#comment-13094572 ] Robert Muir commented on LUCENE-2308: - well, looking at them, thats what they are now already? or am i totally confused? Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: core/index Reporter: Michael McCandless Assignee: Michael McCandless Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: LUCENE-2308-10.patch, LUCENE-2308-11.patch, LUCENE-2308-12.patch, LUCENE-2308-13.patch, LUCENE-2308-14.patch, LUCENE-2308-15.patch, LUCENE-2308-16.patch, LUCENE-2308-17.patch, LUCENE-2308-18.patch, LUCENE-2308-19.patch, LUCENE-2308-2.patch, LUCENE-2308-20.patch, LUCENE-2308-21.patch, LUCENE-2308-3.patch, LUCENE-2308-4.patch, LUCENE-2308-5.patch, LUCENE-2308-6.patch, LUCENE-2308-7.patch, LUCENE-2308-8.patch, LUCENE-2308-9.patch, LUCENE-2308-branch.patch, LUCENE-2308-final.patch, LUCENE-2308-ltc.patch, LUCENE-2308-merge-1.patch, LUCENE-2308-merge-2.patch, LUCENE-2308-merge-3.patch, LUCENE-2308.branchdiffs, LUCENE-2308.branchdiffs.moved, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094570#comment-13094570 ] Chris Male commented on LUCENE-2308: {quote} I think the ideal situation is where StringField/TextField cover the majority of use cases and doing anything with FT is expert, e.g. intended for apps like Solr to implement the interface and probably not even use our 'default' FieldType implementation. {quote} Separate to the implementation issue, I don't think I've fully grasped what you want StringField/TextField to be? Do you seem them as having pre-defined FieldType setups? Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: core/index Reporter: Michael McCandless Assignee: Michael McCandless Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: LUCENE-2308-10.patch, LUCENE-2308-11.patch, LUCENE-2308-12.patch, LUCENE-2308-13.patch, LUCENE-2308-14.patch, LUCENE-2308-15.patch, LUCENE-2308-16.patch, LUCENE-2308-17.patch, LUCENE-2308-18.patch, LUCENE-2308-19.patch, LUCENE-2308-2.patch, LUCENE-2308-20.patch, LUCENE-2308-21.patch, LUCENE-2308-3.patch, LUCENE-2308-4.patch, LUCENE-2308-5.patch, LUCENE-2308-6.patch, LUCENE-2308-7.patch, LUCENE-2308-8.patch, LUCENE-2308-9.patch, LUCENE-2308-branch.patch, LUCENE-2308-final.patch, LUCENE-2308-ltc.patch, LUCENE-2308-merge-1.patch, LUCENE-2308-merge-2.patch, LUCENE-2308-merge-3.patch, LUCENE-2308.branchdiffs, LUCENE-2308.branchdiffs.moved, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094575#comment-13094575 ] Chris Male commented on LUCENE-2308: Ah sorry, when I last looked at them some had constructors which accepted FieldTypes. Now those have been removed so yes, thats what they are now. Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: core/index Reporter: Michael McCandless Assignee: Michael McCandless Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: LUCENE-2308-10.patch, LUCENE-2308-11.patch, LUCENE-2308-12.patch, LUCENE-2308-13.patch, LUCENE-2308-14.patch, LUCENE-2308-15.patch, LUCENE-2308-16.patch, LUCENE-2308-17.patch, LUCENE-2308-18.patch, LUCENE-2308-19.patch, LUCENE-2308-2.patch, LUCENE-2308-20.patch, LUCENE-2308-21.patch, LUCENE-2308-3.patch, LUCENE-2308-4.patch, LUCENE-2308-5.patch, LUCENE-2308-6.patch, LUCENE-2308-7.patch, LUCENE-2308-8.patch, LUCENE-2308-9.patch, LUCENE-2308-branch.patch, LUCENE-2308-final.patch, LUCENE-2308-ltc.patch, LUCENE-2308-merge-1.patch, LUCENE-2308-merge-2.patch, LUCENE-2308-merge-3.patch, LUCENE-2308.branchdiffs, LUCENE-2308.branchdiffs.moved, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-3406) Add source packaging targets that make a tarball from a local working copy
[ https://issues.apache.org/jira/browse/LUCENE-3406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rowe resolved LUCENE-3406. - Resolution: Fixed Committed to trunk and branch_3x. Thanks Seung-Yeoul! Add source packaging targets that make a tarball from a local working copy -- Key: LUCENE-3406 URL: https://issues.apache.org/jira/browse/LUCENE-3406 Project: Lucene - Java Issue Type: Improvement Components: general/build Affects Versions: 3.3, 4.0 Reporter: Seung-Yeoul Yang Assignee: Steven Rowe Priority: Minor Labels: patch Fix For: 3.4, 4.0 Attachments: LUCENE-3406.patch, LUCENE-3406.patch, LUCENE-3406.patch Original Estimate: 24h Remaining Estimate: 24h I am adding back targets that were removed in https://issues.apache.org/jira/browse/LUCENE-2973 that are used to create source distribution packaging from a local working copy as new Ant targets. 2 things to note about the patch: 1) For package-local-src-tgz in solr/build.xml, I had to specify additional directories under solr/ that have been added since LUCENE-2973. 2) I couldn't get the package-tgz-local-src in lucene/build.xml to generate the docs folder, which does get added by package-tgz-src. The patch is against the trunk. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: new AutomatonQuery(RunAutomaton) ?
Thanks Robert, this is what I expected after looking into CompiledAutomaton .. On Wed, Aug 31, 2011 at 2:00 PM, Robert Muir rcm...@gmail.com wrote: On Wed, Aug 31, 2011 at 3:51 AM, eks dev eks...@yahoo.co.uk wrote: At the moment it is not possible (?) to construct AutomatonQuery with RunAutomaton. Would it make sense to add this possibility? Is it doable at all? Its not doable, we need more information than the runautomaton, its not enough. -- lucidimagination.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: new AutomatonQuery(RunAutomaton) ?
Can you provide more information about your automaton and why 'recompiling' it might be expensive? E.g. #states/#transitions, is it finite or infinite, etc. On Wed, Aug 31, 2011 at 10:56 AM, eks dev eks...@yahoo.co.uk wrote: Thanks Robert, this is what I expected after looking into CompiledAutomaton .. On Wed, Aug 31, 2011 at 2:00 PM, Robert Muir rcm...@gmail.com wrote: On Wed, Aug 31, 2011 at 3:51 AM, eks dev eks...@yahoo.co.uk wrote: At the moment it is not possible (?) to construct AutomatonQuery with RunAutomaton. Would it make sense to add this possibility? Is it doable at all? Its not doable, we need more information than the runautomaton, its not enough. -- lucidimagination.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- lucidimagination.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094606#comment-13094606 ] Michael McCandless commented on LUCENE-2308: bq. Builder patterns can be formatted very nice: {noformat} Field f = new Field.Builder() .setFoo() .setBar() .build(); {noformat} This is nice in theory but in practice I often see massive compound hard-to-read lines like this: {noformat} IndexWriter writer = new IndexWriter(dir, newIndexWriterConfig( TEST_VERSION_CURRENT, new MockAnalyzer(random)).setMaxBufferedDocs(2).setMergePolicy(newLogMergePolicy())); {noformat} I don't like that the chained setters make such code possible: it's unreadable. Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: core/index Reporter: Michael McCandless Assignee: Michael McCandless Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: LUCENE-2308-10.patch, LUCENE-2308-11.patch, LUCENE-2308-12.patch, LUCENE-2308-13.patch, LUCENE-2308-14.patch, LUCENE-2308-15.patch, LUCENE-2308-16.patch, LUCENE-2308-17.patch, LUCENE-2308-18.patch, LUCENE-2308-19.patch, LUCENE-2308-2.patch, LUCENE-2308-20.patch, LUCENE-2308-21.patch, LUCENE-2308-3.patch, LUCENE-2308-4.patch, LUCENE-2308-5.patch, LUCENE-2308-6.patch, LUCENE-2308-7.patch, LUCENE-2308-8.patch, LUCENE-2308-9.patch, LUCENE-2308-branch.patch, LUCENE-2308-final.patch, LUCENE-2308-ltc.patch, LUCENE-2308-merge-1.patch, LUCENE-2308-merge-2.patch, LUCENE-2308-merge-3.patch, LUCENE-2308.branchdiffs, LUCENE-2308.branchdiffs.moved, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094615#comment-13094615 ] Uwe Schindler commented on LUCENE-2308: --- This is still more readable than {code:java} new IndexWriter(dir, TEST_VERSION_CURRENT, new MockAnalyzer(random), 2, newLogMergePolicy()); {code} What does 2 mean? Yes its more verbose, but withy any recent UI, the syntax highlighting makes even a one-line chain easy readable. Here some quotes from Jushua Bloch (who is also the founder of Java) about his pattern: [http://www.goodreads.com/author/quotes/60805.Joshua_Bloch] Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: core/index Reporter: Michael McCandless Assignee: Michael McCandless Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: LUCENE-2308-10.patch, LUCENE-2308-11.patch, LUCENE-2308-12.patch, LUCENE-2308-13.patch, LUCENE-2308-14.patch, LUCENE-2308-15.patch, LUCENE-2308-16.patch, LUCENE-2308-17.patch, LUCENE-2308-18.patch, LUCENE-2308-19.patch, LUCENE-2308-2.patch, LUCENE-2308-20.patch, LUCENE-2308-21.patch, LUCENE-2308-3.patch, LUCENE-2308-4.patch, LUCENE-2308-5.patch, LUCENE-2308-6.patch, LUCENE-2308-7.patch, LUCENE-2308-8.patch, LUCENE-2308-9.patch, LUCENE-2308-branch.patch, LUCENE-2308-final.patch, LUCENE-2308-ltc.patch, LUCENE-2308-merge-1.patch, LUCENE-2308-merge-2.patch, LUCENE-2308-merge-3.patch, LUCENE-2308.branchdiffs, LUCENE-2308.branchdiffs.moved, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Issue Comment Edited] (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094615#comment-13094615 ] Uwe Schindler edited comment on LUCENE-2308 at 8/31/11 3:21 PM: This is still more readable than {code:java} new IndexWriter(dir, TEST_VERSION_CURRENT, new MockAnalyzer(random), 2, newLogMergePolicy()); {code} What does 2 mean? Yes its more verbose, but withy any recent UI, the syntax highlighting makes even a one-line chain easy readable. Here some quotes from Jushua Bloch (who is also the founder of Java Collections framework) about his pattern: [http://www.goodreads.com/author/quotes/60805.Joshua_Bloch] was (Author: thetaphi): This is still more readable than {code:java} new IndexWriter(dir, TEST_VERSION_CURRENT, new MockAnalyzer(random), 2, newLogMergePolicy()); {code} What does 2 mean? Yes its more verbose, but withy any recent UI, the syntax highlighting makes even a one-line chain easy readable. Here some quotes from Jushua Bloch (who is also the founder of Java) about his pattern: [http://www.goodreads.com/author/quotes/60805.Joshua_Bloch] Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: core/index Reporter: Michael McCandless Assignee: Michael McCandless Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: LUCENE-2308-10.patch, LUCENE-2308-11.patch, LUCENE-2308-12.patch, LUCENE-2308-13.patch, LUCENE-2308-14.patch, LUCENE-2308-15.patch, LUCENE-2308-16.patch, LUCENE-2308-17.patch, LUCENE-2308-18.patch, LUCENE-2308-19.patch, LUCENE-2308-2.patch, LUCENE-2308-20.patch, LUCENE-2308-21.patch, LUCENE-2308-3.patch, LUCENE-2308-4.patch, LUCENE-2308-5.patch, LUCENE-2308-6.patch, LUCENE-2308-7.patch, LUCENE-2308-8.patch, LUCENE-2308-9.patch, LUCENE-2308-branch.patch, LUCENE-2308-final.patch, LUCENE-2308-ltc.patch, LUCENE-2308-merge-1.patch, LUCENE-2308-merge-2.patch, LUCENE-2308-merge-3.patch, LUCENE-2308.branchdiffs, LUCENE-2308.branchdiffs.moved, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094618#comment-13094618 ] Uwe Schindler commented on LUCENE-2308: --- bq. I don't like that the chained setters make such code possible: it's unreadable. Even as one-line its much better readable than anything else. Did you try to create a WordDelimiterFilter using it 15 argument ctor? Two minutes later you dont know anymore whet the 3rd boolean is about. The chained calls can be read left to right and you can do that very fast. The syntax shown above is just extra sugar, but the one line variant is perfectly readable. OK, not for people still using two whitespaces after the end-of-sentence-period ([http://en.wikipedia.org/wiki/Sentence_spacing]) :-) Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: core/index Reporter: Michael McCandless Assignee: Michael McCandless Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: LUCENE-2308-10.patch, LUCENE-2308-11.patch, LUCENE-2308-12.patch, LUCENE-2308-13.patch, LUCENE-2308-14.patch, LUCENE-2308-15.patch, LUCENE-2308-16.patch, LUCENE-2308-17.patch, LUCENE-2308-18.patch, LUCENE-2308-19.patch, LUCENE-2308-2.patch, LUCENE-2308-20.patch, LUCENE-2308-21.patch, LUCENE-2308-3.patch, LUCENE-2308-4.patch, LUCENE-2308-5.patch, LUCENE-2308-6.patch, LUCENE-2308-7.patch, LUCENE-2308-8.patch, LUCENE-2308-9.patch, LUCENE-2308-branch.patch, LUCENE-2308-final.patch, LUCENE-2308-ltc.patch, LUCENE-2308-merge-1.patch, LUCENE-2308-merge-2.patch, LUCENE-2308-merge-3.patch, LUCENE-2308.branchdiffs, LUCENE-2308.branchdiffs.moved, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094629#comment-13094629 ] Robert Muir commented on LUCENE-2308: - {quote} Uwe, lets open an issue to look at improving WordDelimiterFilter, yeah? I've seen that floating around in tests: new WordDelimiter(1, 1, 0, 1, 1, 0, 0..), yeah its tough to read. {quote} I agree we should open an issue to improve WDF, these int parameters are actually all boolean flags and we could just pass 'int flags' instead. this way you could do new WordDelimiterFilter(GENERATE_WORD_PARTS | CATENATE_ALL | SPLIT_ON_CASE_CHANGE), much more readable. Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: core/index Reporter: Michael McCandless Assignee: Michael McCandless Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: LUCENE-2308-10.patch, LUCENE-2308-11.patch, LUCENE-2308-12.patch, LUCENE-2308-13.patch, LUCENE-2308-14.patch, LUCENE-2308-15.patch, LUCENE-2308-16.patch, LUCENE-2308-17.patch, LUCENE-2308-18.patch, LUCENE-2308-19.patch, LUCENE-2308-2.patch, LUCENE-2308-20.patch, LUCENE-2308-21.patch, LUCENE-2308-3.patch, LUCENE-2308-4.patch, LUCENE-2308-5.patch, LUCENE-2308-6.patch, LUCENE-2308-7.patch, LUCENE-2308-8.patch, LUCENE-2308-9.patch, LUCENE-2308-branch.patch, LUCENE-2308-final.patch, LUCENE-2308-ltc.patch, LUCENE-2308-merge-1.patch, LUCENE-2308-merge-2.patch, LUCENE-2308-merge-3.patch, LUCENE-2308.branchdiffs, LUCENE-2308.branchdiffs.moved, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094632#comment-13094632 ] Michael McCandless commented on LUCENE-2308: bq. Even as one-line its much better readable than anything else. Did you try to create a WordDelimiterFilter using it 15 argument ctor? Two minutes later you dont know anymore whet the 3rd boolean is about. In fact this is what I like about .freeze(): you invoke simple setters, one per line (usually), one object. The only reason we want immutability here is to prevent the trap of changing the FT after binding it to a Field. And freeze accomplishes this well. I agree massive single ctor isn't great; but maybe w/ EnumSet or int flags for the boolean properties it's OK. Or maybe we go back to Field.Index.X, Field.Store.Y, etc. Or stick with .freeze. Builder API can still be built out (eg in contrib or modules or google code or somewhere) on top; I just don't think it should be in Lucene's core. In general Lucene's core should keep things as straightforward as possible. Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: core/index Reporter: Michael McCandless Assignee: Michael McCandless Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: LUCENE-2308-10.patch, LUCENE-2308-11.patch, LUCENE-2308-12.patch, LUCENE-2308-13.patch, LUCENE-2308-14.patch, LUCENE-2308-15.patch, LUCENE-2308-16.patch, LUCENE-2308-17.patch, LUCENE-2308-18.patch, LUCENE-2308-19.patch, LUCENE-2308-2.patch, LUCENE-2308-20.patch, LUCENE-2308-21.patch, LUCENE-2308-3.patch, LUCENE-2308-4.patch, LUCENE-2308-5.patch, LUCENE-2308-6.patch, LUCENE-2308-7.patch, LUCENE-2308-8.patch, LUCENE-2308-9.patch, LUCENE-2308-branch.patch, LUCENE-2308-final.patch, LUCENE-2308-ltc.patch, LUCENE-2308-merge-1.patch, LUCENE-2308-merge-2.patch, LUCENE-2308-merge-3.patch, LUCENE-2308.branchdiffs, LUCENE-2308.branchdiffs.moved, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3410) Make WordDelimiterFilter's instantiation more readable
Make WordDelimiterFilter's instantiation more readable -- Key: LUCENE-3410 URL: https://issues.apache.org/jira/browse/LUCENE-3410 Project: Lucene - Java Issue Type: Improvement Components: modules/analysis Reporter: Chris Male Priority: Minor Currently WordDelimiterFilter's constructor is: {code} public WordDelimiterFilter(TokenStream in, byte[] charTypeTable, int generateWordParts, int generateNumberParts, int catenateWords, int catenateNumbers, int catenateAll, int splitOnCaseChange, int preserveOriginal, int splitOnNumerics, int stemEnglishPossessive, CharArraySet protWords) { {code} which means its instantiation is an unreadable combination of 1s and 0s. We should improve this by either using a Builder, 'int flags' or an EnumSet. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094638#comment-13094638 ] Chris Male commented on LUCENE-2308: LUCENE-3410 for WDF improvements. Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: core/index Reporter: Michael McCandless Assignee: Michael McCandless Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: LUCENE-2308-10.patch, LUCENE-2308-11.patch, LUCENE-2308-12.patch, LUCENE-2308-13.patch, LUCENE-2308-14.patch, LUCENE-2308-15.patch, LUCENE-2308-16.patch, LUCENE-2308-17.patch, LUCENE-2308-18.patch, LUCENE-2308-19.patch, LUCENE-2308-2.patch, LUCENE-2308-20.patch, LUCENE-2308-21.patch, LUCENE-2308-3.patch, LUCENE-2308-4.patch, LUCENE-2308-5.patch, LUCENE-2308-6.patch, LUCENE-2308-7.patch, LUCENE-2308-8.patch, LUCENE-2308-9.patch, LUCENE-2308-branch.patch, LUCENE-2308-final.patch, LUCENE-2308-ltc.patch, LUCENE-2308-merge-1.patch, LUCENE-2308-merge-2.patch, LUCENE-2308-merge-3.patch, LUCENE-2308.branchdiffs, LUCENE-2308.branchdiffs.moved, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3410) Make WordDelimiterFilter's instantiation more readable
[ https://issues.apache.org/jira/browse/LUCENE-3410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094641#comment-13094641 ] Robert Muir commented on LUCENE-3410: - I think flags is a good solution here, its very simple and will improve readability: the backwards compat is obvious too. I think its a bit scary to use enumset, it will involve complicated generics and the jdk itself does not seem to use enumset anywhere! e.g. Pattern.compile(String regex, int flags) I think a builder is overkill here, if someone wants a builder they can always make a builder on top of flags for their own use. Make WordDelimiterFilter's instantiation more readable -- Key: LUCENE-3410 URL: https://issues.apache.org/jira/browse/LUCENE-3410 Project: Lucene - Java Issue Type: Improvement Components: modules/analysis Reporter: Chris Male Priority: Minor Currently WordDelimiterFilter's constructor is: {code} public WordDelimiterFilter(TokenStream in, byte[] charTypeTable, int generateWordParts, int generateNumberParts, int catenateWords, int catenateNumbers, int catenateAll, int splitOnCaseChange, int preserveOriginal, int splitOnNumerics, int stemEnglishPossessive, CharArraySet protWords) { {code} which means its instantiation is an unreadable combination of 1s and 0s. We should improve this by either using a Builder, 'int flags' or an EnumSet. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3410) Make WordDelimiterFilter's instantiation more readable
[ https://issues.apache.org/jira/browse/LUCENE-3410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094642#comment-13094642 ] Yonik Seeley commented on LUCENE-3410: -- For historical context, the reason I used an int for stuff like generateWordParts was that I had the idea of using it as a minimum (i.e. only generate word parts that are over a certain size, etc). This obviously never happened though ;-) Make WordDelimiterFilter's instantiation more readable -- Key: LUCENE-3410 URL: https://issues.apache.org/jira/browse/LUCENE-3410 Project: Lucene - Java Issue Type: Improvement Components: modules/analysis Reporter: Chris Male Priority: Minor Currently WordDelimiterFilter's constructor is: {code} public WordDelimiterFilter(TokenStream in, byte[] charTypeTable, int generateWordParts, int generateNumberParts, int catenateWords, int catenateNumbers, int catenateAll, int splitOnCaseChange, int preserveOriginal, int splitOnNumerics, int stemEnglishPossessive, CharArraySet protWords) { {code} which means its instantiation is an unreadable combination of 1s and 0s. We should improve this by either using a Builder, 'int flags' or an EnumSet. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2650) Empty docs array on response with grouping and result pagination
[ https://issues.apache.org/jira/browse/SOLR-2650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094643#comment-13094643 ] Des Lownds commented on SOLR-2650: -- I was able to duplicate this problem, and was also seeing the following stack trace in some circumstances: {code} SEVERE: java.lang.ArrayIndexOutOfBoundsException: 35 at org.apache.solr.search.DocSlice$1.nextDoc(DocSlice.java:117) at org.apache.solr.response.XMLWriter$3.writeDocs(XMLWriter.java:543) at org.apache.solr.response.XMLWriter.writeDocuments(XMLWriter.java:482) at org.apache.solr.response.XMLWriter.writeDocList(XMLWriter.java:519) at org.apache.solr.response.XMLWriter.writeVal(XMLWriter.java:582) at org.apache.solr.response.XMLWriter.writeNamedList(XMLWriter.java:620) at org.apache.solr.response.XMLWriter.writeVal(XMLWriter.java:593) at org.apache.solr.response.XMLWriter.writeNamedList(XMLWriter.java:620) at org.apache.solr.response.XMLWriter.writeVal(XMLWriter.java:593) at org.apache.solr.response.XMLWriter.writeResponse(XMLWriter.java:131) at org.apache.solr.response.XMLResponseWriter.write(XMLResponseWriter.java:35) at org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:343) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:265) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:589) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:291) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:859) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:602) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489) at java.lang.Thread.run(Thread.java:662) {code} Empty docs array on response with grouping and result pagination Key: SOLR-2650 URL: https://issues.apache.org/jira/browse/SOLR-2650 Project: Solr Issue Type: Bug Components: search Affects Versions: 3.3 Reporter: Massimo Schiavon Requesting a certain number of rows and setting start parameter to a greater value returns 0 results with grouping enabled. For example, requesting: http://localhost:8080/solr/web/select/?q=*:*rows=1start=2 (grouping and highlighting are enabled by default) I get this response: [...] response: { numFound: 117852 start: 2 docs: [ ] } highlighting: { 0938630598: { title: [ ... ] content: [ ... ] } } [...] docs array is empty while the highlighted values of the document are present Debugging the request in org.apache.solr.search.Grouping.Command.createSimpleResponse() at row 534 [...] int len = Math.min(numGroups, docsGathered); if (offset len) { len = 0; } [...] The initial vars values are: numGroups = 1 docsGathered = 3 offset = 2 so after the execution len = 0 I've tried commenting the if statement and this resolves the issue but could introduce some other bugs. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2650) Empty docs array on response with grouping and result pagination
[ https://issues.apache.org/jira/browse/SOLR-2650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094647#comment-13094647 ] Des Lownds commented on SOLR-2650: -- Seems that using group.format=simple results in the ArrayIndexOutOfBounds exception, while using standard format returns wrong results(no results.) Empty docs array on response with grouping and result pagination Key: SOLR-2650 URL: https://issues.apache.org/jira/browse/SOLR-2650 Project: Solr Issue Type: Bug Components: search Affects Versions: 3.3 Reporter: Massimo Schiavon Requesting a certain number of rows and setting start parameter to a greater value returns 0 results with grouping enabled. For example, requesting: http://localhost:8080/solr/web/select/?q=*:*rows=1start=2 (grouping and highlighting are enabled by default) I get this response: [...] response: { numFound: 117852 start: 2 docs: [ ] } highlighting: { 0938630598: { title: [ ... ] content: [ ... ] } } [...] docs array is empty while the highlighted values of the document are present Debugging the request in org.apache.solr.search.Grouping.Command.createSimpleResponse() at row 534 [...] int len = Math.min(numGroups, docsGathered); if (offset len) { len = 0; } [...] The initial vars values are: numGroups = 1 docsGathered = 3 offset = 2 so after the execution len = 0 I've tried commenting the if statement and this resolves the issue but could introduce some other bugs. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2650) Empty docs array on response with grouping and result pagination
[ https://issues.apache.org/jira/browse/SOLR-2650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094651#comment-13094651 ] Yonik Seeley commented on SOLR-2650: There are a few grouping+paging bugs fixed in 3x (which will be 3.4 when released). Can you try a recent 3x nightly build and see if any of the problems remain? Empty docs array on response with grouping and result pagination Key: SOLR-2650 URL: https://issues.apache.org/jira/browse/SOLR-2650 Project: Solr Issue Type: Bug Components: search Affects Versions: 3.3 Reporter: Massimo Schiavon Requesting a certain number of rows and setting start parameter to a greater value returns 0 results with grouping enabled. For example, requesting: http://localhost:8080/solr/web/select/?q=*:*rows=1start=2 (grouping and highlighting are enabled by default) I get this response: [...] response: { numFound: 117852 start: 2 docs: [ ] } highlighting: { 0938630598: { title: [ ... ] content: [ ... ] } } [...] docs array is empty while the highlighted values of the document are present Debugging the request in org.apache.solr.search.Grouping.Command.createSimpleResponse() at row 534 [...] int len = Math.min(numGroups, docsGathered); if (offset len) { len = 0; } [...] The initial vars values are: numGroups = 1 docsGathered = 3 offset = 2 so after the execution len = 0 I've tried commenting the if statement and this resolves the issue but could introduce some other bugs. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3410) Make WordDelimiterFilter's instantiation more readable
[ https://issues.apache.org/jira/browse/LUCENE-3410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094653#comment-13094653 ] Uwe Schindler commented on LUCENE-3410: --- OK, if those integers are always used only as boolean flags, I would prefer a single (int flags) parameter. No builder pattern needed. I would maybe prefer a long to make it extensibler (but 31 flags should be enough, too). Make WordDelimiterFilter's instantiation more readable -- Key: LUCENE-3410 URL: https://issues.apache.org/jira/browse/LUCENE-3410 Project: Lucene - Java Issue Type: Improvement Components: modules/analysis Reporter: Chris Male Priority: Minor Currently WordDelimiterFilter's constructor is: {code} public WordDelimiterFilter(TokenStream in, byte[] charTypeTable, int generateWordParts, int generateNumberParts, int catenateWords, int catenateNumbers, int catenateAll, int splitOnCaseChange, int preserveOriginal, int splitOnNumerics, int stemEnglishPossessive, CharArraySet protWords) { {code} which means its instantiation is an unreadable combination of 1s and 0s. We should improve this by either using a Builder, 'int flags' or an EnumSet. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: more granular updateRequestProcessorChain
On Wed, Aug 31, 2011 at 7:52 AM, Jan Høydahl jan@cominvent.com wrote: Hi, Can you explain the wanted functional result of your copy operation? I've done copying fields in processors without trouble. What do you want to do with the Lucene Document? Indeed - I've started going in the opposite direction and removed the lucene Document from the AddUpdateCommand altogether (see SOLR-2700). -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094656#comment-13094656 ] Uwe Schindler commented on LUCENE-2308: --- bq. The only reason we want immutability here is to prevent the trap of changing the FT after binding it to a Field. And freeze accomplishes this well. Where is the difference to builder? You can also call builders one per line if you like it. I like builders especially for their readability: You can read the line and break it at any place just like a sentence. This is why the method names should look like sentence components and not setXXX() like (ideally). Freeze is an antipattern as you use one object for changing and then for freezing, leading to if-checks everywhere. If you make return freeze() an new immutable object, it is builder. Just without the possibility to chain. I dislike freeze, but if you want to do this, please add chaining, it costs you nothing as implementor. Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: core/index Reporter: Michael McCandless Assignee: Michael McCandless Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: LUCENE-2308-10.patch, LUCENE-2308-11.patch, LUCENE-2308-12.patch, LUCENE-2308-13.patch, LUCENE-2308-14.patch, LUCENE-2308-15.patch, LUCENE-2308-16.patch, LUCENE-2308-17.patch, LUCENE-2308-18.patch, LUCENE-2308-19.patch, LUCENE-2308-2.patch, LUCENE-2308-20.patch, LUCENE-2308-21.patch, LUCENE-2308-3.patch, LUCENE-2308-4.patch, LUCENE-2308-5.patch, LUCENE-2308-6.patch, LUCENE-2308-7.patch, LUCENE-2308-8.patch, LUCENE-2308-9.patch, LUCENE-2308-branch.patch, LUCENE-2308-final.patch, LUCENE-2308-ltc.patch, LUCENE-2308-merge-1.patch, LUCENE-2308-merge-2.patch, LUCENE-2308-merge-3.patch, LUCENE-2308.branchdiffs, LUCENE-2308.branchdiffs.moved, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094662#comment-13094662 ] Yonik Seeley commented on LUCENE-2308: -- Since it seems like there is no agreement on enforcing immutability, perhaps we shouldn't. We don't do it in a lot of other places, for example all of our query classes (and I don't think we should start). Rethinking the interface a bit... even that seems like a little overkill (and perhaps just a by-product of no one agreeing on the concrete implementation?) After all, if this is to just be a holder for parameters (like indexed, stored, etc) then allowing one to subclass doesn't add any power or even make much sense (they aren't going to change the behavior of anything, right?) The other normal use cases for interfaces wouldn't seem to apply to this situation either. Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: core/index Reporter: Michael McCandless Assignee: Michael McCandless Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: LUCENE-2308-10.patch, LUCENE-2308-11.patch, LUCENE-2308-12.patch, LUCENE-2308-13.patch, LUCENE-2308-14.patch, LUCENE-2308-15.patch, LUCENE-2308-16.patch, LUCENE-2308-17.patch, LUCENE-2308-18.patch, LUCENE-2308-19.patch, LUCENE-2308-2.patch, LUCENE-2308-20.patch, LUCENE-2308-21.patch, LUCENE-2308-3.patch, LUCENE-2308-4.patch, LUCENE-2308-5.patch, LUCENE-2308-6.patch, LUCENE-2308-7.patch, LUCENE-2308-8.patch, LUCENE-2308-9.patch, LUCENE-2308-branch.patch, LUCENE-2308-final.patch, LUCENE-2308-ltc.patch, LUCENE-2308-merge-1.patch, LUCENE-2308-merge-2.patch, LUCENE-2308-merge-3.patch, LUCENE-2308.branchdiffs, LUCENE-2308.branchdiffs.moved, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094664#comment-13094664 ] Robert Muir commented on LUCENE-2308: - {quote} I agree massive single ctor isn't great; but maybe w/ EnumSet or int flags for the boolean properties it's OK. {quote} Maybe FieldType should really just be an 'int' (e.g. we dont have a class or anything) ? Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: core/index Reporter: Michael McCandless Assignee: Michael McCandless Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: LUCENE-2308-10.patch, LUCENE-2308-11.patch, LUCENE-2308-12.patch, LUCENE-2308-13.patch, LUCENE-2308-14.patch, LUCENE-2308-15.patch, LUCENE-2308-16.patch, LUCENE-2308-17.patch, LUCENE-2308-18.patch, LUCENE-2308-19.patch, LUCENE-2308-2.patch, LUCENE-2308-20.patch, LUCENE-2308-21.patch, LUCENE-2308-3.patch, LUCENE-2308-4.patch, LUCENE-2308-5.patch, LUCENE-2308-6.patch, LUCENE-2308-7.patch, LUCENE-2308-8.patch, LUCENE-2308-9.patch, LUCENE-2308-branch.patch, LUCENE-2308-final.patch, LUCENE-2308-ltc.patch, LUCENE-2308-merge-1.patch, LUCENE-2308-merge-2.patch, LUCENE-2308-merge-3.patch, LUCENE-2308.branchdiffs, LUCENE-2308.branchdiffs.moved, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094665#comment-13094665 ] Uwe Schindler commented on LUCENE-2308: --- bq. However whether or not people agree with you on Builders and chain calls, at this stage there just isn't the support to make Builders mandatory. Yes we should create one and I'll look to you for help on that. But as a first step forward lets move FieldType over to being a get-only interface. That will leave the freezable API in there and we can then consider the next step forward. But again, I don't really see consensus on the Builder-only approach. Rather I see a lot of support for having a single ctor implementation and a builder using that. I would like to have an on-list vote, please. Thanks. Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: core/index Reporter: Michael McCandless Assignee: Michael McCandless Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: LUCENE-2308-10.patch, LUCENE-2308-11.patch, LUCENE-2308-12.patch, LUCENE-2308-13.patch, LUCENE-2308-14.patch, LUCENE-2308-15.patch, LUCENE-2308-16.patch, LUCENE-2308-17.patch, LUCENE-2308-18.patch, LUCENE-2308-19.patch, LUCENE-2308-2.patch, LUCENE-2308-20.patch, LUCENE-2308-21.patch, LUCENE-2308-3.patch, LUCENE-2308-4.patch, LUCENE-2308-5.patch, LUCENE-2308-6.patch, LUCENE-2308-7.patch, LUCENE-2308-8.patch, LUCENE-2308-9.patch, LUCENE-2308-branch.patch, LUCENE-2308-final.patch, LUCENE-2308-ltc.patch, LUCENE-2308-merge-1.patch, LUCENE-2308-merge-2.patch, LUCENE-2308-merge-3.patch, LUCENE-2308.branchdiffs, LUCENE-2308.branchdiffs.moved, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch, LUCENE-2308.patch This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: new AutomatonQuery(RunAutomaton) ?
I do not think it will be expensive, it is just an attempt to keep code smaller, simpler and marginally faster :) those are a lot (Ca 1000) of small prefix based regex-es with limited alphabet compiled as RunAutomaton I load on startup and lookup from some RunAutomaton[] on request... they look like Regex(((123)|(124)|(401)|(777)|(351))[0-9]{0,2}) By the way, what will AutomatonQuery prefer (XXX)[0-9]{0,2} or (XXX)[0-9]* or (XXX).* ? Any performance difference? Semantically are they the same as I know that my content is only 5 digits I need them to 1. formulate complex BooleanQuery, where AutomatonQuery gets one clause 2. do post processing (a lot of hits) of the query against hits and this has to be fast. I guess, I will switch to keeping only Automaton[] and build RunAutomaton on the fly (per request) for fast query vs hits, this is done once per request only, but them I need to keep state of the RunAutomaton per query... makes things slightly more verbose... On Wed, Aug 31, 2011 at 5:06 PM, Robert Muir rcm...@gmail.com wrote: Can you provide more information about your automaton and why 'recompiling' it might be expensive? E.g. #states/#transitions, is it finite or infinite, etc. On Wed, Aug 31, 2011 at 10:56 AM, eks dev eks...@yahoo.co.uk wrote: Thanks Robert, this is what I expected after looking into CompiledAutomaton .. On Wed, Aug 31, 2011 at 2:00 PM, Robert Muir rcm...@gmail.com wrote: On Wed, Aug 31, 2011 at 3:51 AM, eks dev eks...@yahoo.co.uk wrote: At the moment it is not possible (?) to construct AutomatonQuery with RunAutomaton. Would it make sense to add this possibility? Is it doable at all? Its not doable, we need more information than the runautomaton, its not enough. -- lucidimagination.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- lucidimagination.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: new AutomatonQuery(RunAutomaton) ?
On Wed, Aug 31, 2011 at 1:30 PM, eks dev eks...@googlemail.com wrote: I do not think it will be expensive, it is just an attempt to keep code smaller, simpler and marginally faster :) I think you will find the compile is pretty fast, this only happens once per query too (its not per-segment or anything)... see below those are a lot (Ca 1000) of small prefix based regex-es with limited alphabet compiled as RunAutomaton I load on startup and lookup from some RunAutomaton[] on request... they look like Regex(((123)|(124)|(401)|(777)|(351))[0-9]{0,2}) By the way, what will AutomatonQuery prefer (XXX)[0-9]{0,2} or (XXX)[0-9]* or (XXX).* ? Any performance difference? Well, you would have to benchmark, and it definitely depends on your content. (XXX)[0-9]{0,2} is the 'simplest' automaton in that its finite, if you actually have (XXX)[0-9][0-9]junk it will seek past that. the other two forms you listed are infinite, and when automatonquery finds a 'loop' in the automaton, it turns itself into a 'filtering rangequery' temporarily with the upperbound being the end of the transition. This prevents it from doing a lot of useless disk seeks. if you have (XXX)[0-9]* this is going to seek to (XXX) and then act as a range query to (XXX)a (exclusive, just indicating a is the first valid term after the infinitely long pattern (XXX)9..) then for each term in the range query its going to 'check' that it matches the automaton. (XXX).* will be similar to the above, except its going to be obviously accept more terms, e.g. (XXX)m, and its 'range query' will be something like (XXX)-(XXY) Semantically are they the same as I know that my content is only 5 digits I need them to 1. formulate complex BooleanQuery, where AutomatonQuery gets one clause 2. do post processing (a lot of hits) of the query against hits and this has to be fast. I guess, I will switch to keeping only Automaton[] and build RunAutomaton on the fly (per request) for fast query vs hits, this is done once per request only, but them I need to keep state of the RunAutomaton per query... makes things slightly more verbose... AutomatonQuery computes this stuff a single time, up-front in its constructor. Can you just reuse the AutomatonQuery(s)? in your app? This should work fine. -- lucidimagination.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: new AutomatonQuery(RunAutomaton) ?
On Wed, Aug 31, 2011 at 1:44 PM, Robert Muir rcm...@gmail.com wrote: By the way, what will AutomatonQuery prefer (XXX)[0-9]{0,2} or (XXX)[0-9]* or (XXX).* ? Any performance difference? Also, what I said only applies to old term dictionaries implementations... if you are absolutely using the latest trunk with BlockTree (https://issues.apache.org/jira/browse/LUCENE-3030), the rules don't apply, the automaton is intersected in a totally different way -- lucidimagination.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (SOLR-2687) Add new Solr book 'Apache Solr 3.1 Cookbook' to selection of Solr books and news.
[ https://issues.apache.org/jira/browse/SOLR-2687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Høydahl resolved SOLR-2687. --- Resolution: Fixed Ok, this is committed, thanks for the contrib! Add new Solr book 'Apache Solr 3.1 Cookbook' to selection of Solr books and news. - Key: SOLR-2687 URL: https://issues.apache.org/jira/browse/SOLR-2687 Project: Solr Issue Type: Task Reporter: Julian Copes Assignee: Jan Høydahl Attachments: solr-2687.patch, solr_31_cookbook.jpg Find below the news of the new Solr book. I can provide an image when prompted. Below is a news item and I've included the URL for the new book. The text is as follows: Rafał Kuć is proud to introduce a new book on Solr, Apache Solr 3.1 Cookbook from Packt Publishing. The Solr 3.1 Cookbook will make your everyday work easier by using real-life examples that show you how to deal with the most common problems that can arise while using the Apache Solr search engine. This cookbook will show you how to get the most out of your search engine. Each chapter covers a different aspect of working with Solr from analyzing your text data through querying, performance improvement, and developing your own modules. The practical recipes will help you to quickly solve common problems with data analysis, show you how to use faceting to collect data and to speed up the performance of Solr. You will learn about functionalities that most newbies are unaware of, such as sorting results by a function value, highlighting matched words, and computing statistics to make your work with Solr easy and stress free. Click here to read more about the Apache Solr 3.1 Cookbook. (http://www.packtpub.com/solr-3-1-enterprise-search-server-cookbook/book) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Web site updated
Hi, After committing SOLR-2454 and SOLR-2687 I have now updated the solr site (from trunk). Should propagate soon. If any problems, old /www/lucene.apache.org/solr is backed up as solr.old.janhoy :) -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: new AutomatonQuery(RunAutomaton) ?
bytes are good, I am in byte range on this data, and even simpler is good :) It is simple, I just need to know if this automaton I used for AutomatonQuery accepts one stored field, so yes it is the same information as in Term, but I need to run over it once more because my query is not filtering on AutomatonQuery ((AutomatonQuery(A)) OR (OtherQuery) )+ So I get back documents not matched by this Automaton and I do not know which ones are there due to the OtherQuery running search in 2 passes, with and without automaton is not practicable On Wed, Aug 31, 2011 at 8:45 PM, Robert Muir rcm...@gmail.com wrote: On Wed, Aug 31, 2011 at 2:37 PM, eks dev eks...@yahoo.co.uk wrote: Keeping AutomatonQuery around came to me as an option, but do not forget, I need Automaton (RunAutomaton) for post processing... There is no way to get Automaton back from the AutomatonQuery? The compiled automaton is not always a RunAutomaton, sometimes its internal representation is something even simpler :) Additionally, when it is a RunAutomaton, its a UTF-8 one, for operating directly on bytes... Can you describe a little bit about what 'post processing' you need to do? I imagine its post processing on something other than the terms? -- lucidimagination.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2650) Empty docs array on response with grouping and result pagination
[ https://issues.apache.org/jira/browse/SOLR-2650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Des Lownds updated SOLR-2650: - Attachment: grouping_patch.txt patch file Empty docs array on response with grouping and result pagination Key: SOLR-2650 URL: https://issues.apache.org/jira/browse/SOLR-2650 Project: Solr Issue Type: Bug Components: search Affects Versions: 3.3 Reporter: Massimo Schiavon Attachments: grouping_patch.txt Requesting a certain number of rows and setting start parameter to a greater value returns 0 results with grouping enabled. For example, requesting: http://localhost:8080/solr/web/select/?q=*:*rows=1start=2 (grouping and highlighting are enabled by default) I get this response: [...] response: { numFound: 117852 start: 2 docs: [ ] } highlighting: { 0938630598: { title: [ ... ] content: [ ... ] } } [...] docs array is empty while the highlighted values of the document are present Debugging the request in org.apache.solr.search.Grouping.Command.createSimpleResponse() at row 534 [...] int len = Math.min(numGroups, docsGathered); if (offset len) { len = 0; } [...] The initial vars values are: numGroups = 1 docsGathered = 3 offset = 2 so after the execution len = 0 I've tried commenting the if statement and this resolves the issue but could introduce some other bugs. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2650) Empty docs array on response with grouping and result pagination
[ https://issues.apache.org/jira/browse/SOLR-2650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094896#comment-13094896 ] Des Lownds commented on SOLR-2650: -- I'd be happy to test a nightly, where do I download them from? or is it a svn co? Empty docs array on response with grouping and result pagination Key: SOLR-2650 URL: https://issues.apache.org/jira/browse/SOLR-2650 Project: Solr Issue Type: Bug Components: search Affects Versions: 3.3 Reporter: Massimo Schiavon Attachments: grouping_patch.txt Requesting a certain number of rows and setting start parameter to a greater value returns 0 results with grouping enabled. For example, requesting: http://localhost:8080/solr/web/select/?q=*:*rows=1start=2 (grouping and highlighting are enabled by default) I get this response: [...] response: { numFound: 117852 start: 2 docs: [ ] } highlighting: { 0938630598: { title: [ ... ] content: [ ... ] } } [...] docs array is empty while the highlighted values of the document are present Debugging the request in org.apache.solr.search.Grouping.Command.createSimpleResponse() at row 534 [...] int len = Math.min(numGroups, docsGathered); if (offset len) { len = 0; } [...] The initial vars values are: numGroups = 1 docsGathered = 3 offset = 2 so after the execution len = 0 I've tried commenting the if statement and this resolves the issue but could introduce some other bugs. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2906) Filter to process output of ICUTokenizer and create overlapping bigrams for CJK
[ https://issues.apache.org/jira/browse/LUCENE-2906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094902#comment-13094902 ] Tom Burton-West commented on LUCENE-2906: - Any chance this might get implemented for 3.4? Filter to process output of ICUTokenizer and create overlapping bigrams for CJK Key: LUCENE-2906 URL: https://issues.apache.org/jira/browse/LUCENE-2906 Project: Lucene - Java Issue Type: New Feature Components: modules/analysis Reporter: Tom Burton-West Priority: Minor Fix For: 3.4, 4.0 Attachments: LUCENE-2906.patch The ICUTokenizer produces unigrams for CJK. We would like to use the ICUTokenizer but have overlapping bigrams created for CJK as in the CJK Analyzer. This filter would take the output of the ICUtokenizer, read the ScriptAttribute and for selected scripts (Han, Kana), would produce overlapping bigrams. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Issue Comment Edited] (LUCENE-3167) Make lucene/solr a OSGI bundle through Ant
[ https://issues.apache.org/jira/browse/LUCENE-3167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094903#comment-13094903 ] Luca Stancapiano edited comment on LUCENE-3167 at 8/31/11 9:13 PM: --- Hi Steven, I send a new updated patch. I added two new stamp properties in the build-manifest macro (start.touch.time and end.touch.time) that log the milliseconds of the process. War files in OSGI are worked as the jar files. If the OSGI repository has functionalities to work with containers, it takes the informations directly by the bundle. The MANIFEST.MF file doesn't include informations about containers. I added the bnd library from http://dl.dropbox.com/u/2590603/bnd/biz.aQute.bndlib.jar (actually in the dropbox there is the only version for ant. See: http://www.aqute.biz/Bnd/Download) and added it to the ant classpath how for the 'generate-maven-artifacts' target. Here the responses to the tasks: 1 - checked the box to grant the Apache license. 2 - Renamed the patch according the convetion. 3 - Deleted the bnd configuration for solr. Now only the build-manifest macro declared in the common-build.xml of lucene project is used. But I was forced to declare the attributes @{title} and @{implementation.title} as properties inside the build-manifest macro, else they didn't seen in the external file lucene.bnd. 4 - I see the correct value of ${bnd.project.description} because the property is created through the configuration : xmlproperty file=${ant.file} collapseAttributes=true prefix=bnd/ inside the build-manifest macro. Maybe I didn't added all in the previous patch. Let me know if the problem persists. 5 - I excluded the DSTAMP, TSTAMP, and TODAY properties by the bnd configuration through the property: -removeheaders . The main problem is that the bnd ant task takes all the ant properties starting with an uppercased lecter and add them without ask. Should be a bnd property -inherit (true/false) that tells if import the ant properties but it doesn't work. This problem is signed in: https://github.com/bnd/bnd/issues/72. An other important thing is that the 'Name' ant property declared in some build.xml is not accepted by the bnd ant task. In the bnd ant task code there is an hard exception if the 'Name' property is found: if (header.equalsIgnoreCase(Name)) { error(Your bnd file contains a header called 'Name'. This interferes with the manifest name section.); continue; } So I was forced to rename the 'Name' property and its references in 'LuceneName' 6 - Added the ${user.name} property in the Implementation-Version manifest property 7 - Renamed the Bundle-DocUR property to Bundle-DocURL was (Author: luca.stancaqpiano): Hi Steven, I send a new updated patch. I added two new stamp properties in the build-manifest macro (start.touch.time and end.touch.time) that log the milliseconds of the process. War files in OSGI are worked as the jar files. If the OSGI repository has functionalities to work with containers, it takes the informations directly by the bundle. The MANIFEST.MF file doesn't include informations about containers. I added the bnd library from http://dl.dropbox.com/u/2590603/bnd/biz.aQute.bndlib.jar (actually in the dropbox there is the only version for ant. See: http://www.aqute.biz/Bnd/Download) and added it to the ant classpath how for the 'generate-maven-artifacts' target. Here the reposts to the tasks: 1 - checked the box to grant the Apache license. 2 - Renamed the patch according the convetion. 3 - Deleted the bnd configuration for solr. Now only the build-manifest macro declared in the common-build.xml of lucene project is used. But I was forced to declare the attributes @{title} and @{implementation.title} as properties inside the build-manifest macro, else they didn't seen in the external file lucene.bnd. 4 - I see the correct value of ${bnd.project.description} because the property is created through the configuration : xmlproperty file=${ant.file} collapseAttributes=true prefix=bnd/ inside the build-manifest macro. Maybe I didn't added all in the previous patch. Let me know if the problem persists. 5 - I excluded the DSTAMP, TSTAMP, and TODAY properties by the bnd configuration through the property: -removeheaders . The main problem is that the bnd ant task takes all the ant properties starting with an uppercased lecter and add them without ask. Should be a bnd property -inherit (true/false) that tells if import the ant properties but it doesn't work. This problem is signed in: https://github.com/bnd/bnd/issues/72. An other important thing is that the 'Name' ant property declared in some build.xml is not accepted by the bnd ant task. In the bnd ant task code there is an hard exception if
[jira] [Updated] (LUCENE-3167) Make lucene/solr a OSGI bundle through Ant
[ https://issues.apache.org/jira/browse/LUCENE-3167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luca Stancapiano updated LUCENE-3167: - Attachment: LUCENE-3167.patch Hi Steven, I send a new updated patch. I added two new stamp properties in the build-manifest macro (start.touch.time and end.touch.time) that log the milliseconds of the process. War files in OSGI are worked as the jar files. If the OSGI repository has functionalities to work with containers, it takes the informations directly by the bundle. The MANIFEST.MF file doesn't include informations about containers. I added the bnd library from http://dl.dropbox.com/u/2590603/bnd/biz.aQute.bndlib.jar (actually in the dropbox there is the only version for ant. See: http://www.aqute.biz/Bnd/Download) and added it to the ant classpath how for the 'generate-maven-artifacts' target. Here the reposts to the tasks: 1 - checked the box to grant the Apache license. 2 - Renamed the patch according the convetion. 3 - Deleted the bnd configuration for solr. Now only the build-manifest macro declared in the common-build.xml of lucene project is used. But I was forced to declare the attributes @{title} and @{implementation.title} as properties inside the build-manifest macro, else they didn't seen in the external file lucene.bnd. 4 - I see the correct value of ${bnd.project.description} because the property is created through the configuration : xmlproperty file=${ant.file} collapseAttributes=true prefix=bnd/ inside the build-manifest macro. Maybe I didn't added all in the previous patch. Let me know if the problem persists. 5 - I excluded the DSTAMP, TSTAMP, and TODAY properties by the bnd configuration through the property: -removeheaders . The main problem is that the bnd ant task takes all the ant properties starting with an uppercased lecter and add them without ask. Should be a bnd property -inherit (true/false) that tells if import the ant properties but it doesn't work. This problem is signed in: https://github.com/bnd/bnd/issues/72. An other important thing is that the 'Name' ant property declared in some build.xml is not accepted by the bnd ant task. In the bnd ant task code there is an hard exception if the 'Name' property is found: if (header.equalsIgnoreCase(Name)) { error(Your bnd file contains a header called 'Name'. This interferes with the manifest name section.); continue; } So I was forced to rename the 'Name' property and its references in 'LuceneName' 6 - Added the ${user.name} property in the Implementation-Version manifest property 7 - Renamed the Bundle-DocUR property to Bundle-DocURL Make lucene/solr a OSGI bundle through Ant -- Key: LUCENE-3167 URL: https://issues.apache.org/jira/browse/LUCENE-3167 Project: Lucene - Java Issue Type: New Feature Environment: bndtools Reporter: Luca Stancapiano Attachments: LUCENE-3167.patch, lucene_trunk.patch, lucene_trunk.patch We need to make a bundle thriugh Ant, so the binary can be published and no more need the download of the sources. Actually to get a OSGI bundle we need to use maven tools and build the sources. Here the reference for the creation of the OSGI bundle through Maven: https://issues.apache.org/jira/browse/LUCENE-1344 Bndtools could be used inside Ant -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Issue Comment Edited] (LUCENE-3167) Make lucene/solr a OSGI bundle through Ant
[ https://issues.apache.org/jira/browse/LUCENE-3167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094903#comment-13094903 ] Luca Stancapiano edited comment on LUCENE-3167 at 8/31/11 9:14 PM: --- Hi Steven, I send a new updated patch. I added two new stamp properties in the build-manifest macro (start.touch.time and end.touch.time) that log the milliseconds of the process. War files in OSGI are worked as the jar files. If the OSGI repository has functionalities to work with containers, it takes the informations directly by the bundle. The MANIFEST.MF file doesn't include informations about containers. I added the bnd library from http://dl.dropbox.com/u/2590603/bnd/biz.aQute.bndlib.jar (actually in the dropbox there is the only version for ant. See: http://www.aqute.biz/Bnd/Download) and added it to the ant classpath how for the 'generate-maven-artifacts' target. Here the responses to the tasks: 1 - checked the box to grant the Apache license. 2 - Renamed the patch according the convetion. 3 - Deleted the bnd configuration for solr. Now only the build-manifest macro declared in the common-build.xml of lucene project is used. But I was forced to declare the attributes @{title} and @{implementation.title} as properties inside the build-manifest macro, else they didn't seen in the external file lucene.bnd. 4 - I see the correct value of ${bnd.project.description} because the property is created through the configuration : xmlproperty file=${ant.file} collapseAttributes=true prefix=bnd/ inside the build-manifest macro. Maybe I didn't added all in the previous patch. Let me know if the problem persists. 5 - I excluded the DSTAMP, TSTAMP, and TODAY properties by the bnd configuration through the property: -removeheaders . The main problem is that the bnd ant task takes all the ant properties starting with an uppercased lecter and add them without ask. Should be a bnd property -inherit (true/false) that tells if import the ant properties but it doesn't work. This problem is signed in: https://github.com/bnd/bnd/issues/72. An other important thing is that the 'Name' ant property declared in some build.xml is not accepted by the bnd ant task. In the bnd ant task code there is an hard exception if the 'Name' property is found: {code} if (header.equalsIgnoreCase(Name)) { error(Your bnd file contains a header called 'Name'. This interferes with the manifest name section.); continue; } {code} So I was forced to rename the 'Name' property and its references in 'LuceneName' 6 - Added the ${user.name} property in the Implementation-Version manifest property 7 - Renamed the Bundle-DocUR property to Bundle-DocURL was (Author: luca.stancaqpiano): Hi Steven, I send a new updated patch. I added two new stamp properties in the build-manifest macro (start.touch.time and end.touch.time) that log the milliseconds of the process. War files in OSGI are worked as the jar files. If the OSGI repository has functionalities to work with containers, it takes the informations directly by the bundle. The MANIFEST.MF file doesn't include informations about containers. I added the bnd library from http://dl.dropbox.com/u/2590603/bnd/biz.aQute.bndlib.jar (actually in the dropbox there is the only version for ant. See: http://www.aqute.biz/Bnd/Download) and added it to the ant classpath how for the 'generate-maven-artifacts' target. Here the responses to the tasks: 1 - checked the box to grant the Apache license. 2 - Renamed the patch according the convetion. 3 - Deleted the bnd configuration for solr. Now only the build-manifest macro declared in the common-build.xml of lucene project is used. But I was forced to declare the attributes @{title} and @{implementation.title} as properties inside the build-manifest macro, else they didn't seen in the external file lucene.bnd. 4 - I see the correct value of ${bnd.project.description} because the property is created through the configuration : xmlproperty file=${ant.file} collapseAttributes=true prefix=bnd/ inside the build-manifest macro. Maybe I didn't added all in the previous patch. Let me know if the problem persists. 5 - I excluded the DSTAMP, TSTAMP, and TODAY properties by the bnd configuration through the property: -removeheaders . The main problem is that the bnd ant task takes all the ant properties starting with an uppercased lecter and add them without ask. Should be a bnd property -inherit (true/false) that tells if import the ant properties but it doesn't work. This problem is signed in: https://github.com/bnd/bnd/issues/72. An other important thing is that the 'Name' ant property declared in some build.xml is not accepted by the bnd ant task. In the bnd ant task code there is an
[jira] [Commented] (LUCENE-3167) Make lucene/solr a OSGI bundle through Ant
[ https://issues.apache.org/jira/browse/LUCENE-3167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094931#comment-13094931 ] Steven Rowe commented on LUCENE-3167: - Hi Luca, I'll take a look at your new patch today or tomorrow. Have you done any timings yet? I don't understand a couple of things you wrote: bq. War files in OSGI are worked as the jar files. Do you mean that OSGI treats .war files the same as .jar files? bq. If the OSGI repository has functionalities to work with containers, it takes the informations directly by the bundle. The MANIFEST.MF file doesn't include informations about containers. What is a container? What is a bundle? Why does it matter that MANIFEST.MF does not include information about containers? How are these things related to the other topics under discussion on this issue? (I wasn't kidding when I wrote that I know nothing about OSGi.) Make lucene/solr a OSGI bundle through Ant -- Key: LUCENE-3167 URL: https://issues.apache.org/jira/browse/LUCENE-3167 Project: Lucene - Java Issue Type: New Feature Environment: bndtools Reporter: Luca Stancapiano Attachments: LUCENE-3167.patch, lucene_trunk.patch, lucene_trunk.patch We need to make a bundle thriugh Ant, so the binary can be published and no more need the download of the sources. Actually to get a OSGI bundle we need to use maven tools and build the sources. Here the reference for the creation of the OSGI bundle through Maven: https://issues.apache.org/jira/browse/LUCENE-1344 Bndtools could be used inside Ant -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: new AutomatonQuery(RunAutomaton) ?
Can you clone the AutomatonQuery and combine it with a filter returning a single document? There is code that does something like this in LUCENE-3318. That way you can test if the automaton matches a document without the need to tease it apart. -Mike On 08/31/2011 04:32 PM, eks dev wrote: bytes are good, I am in byte range on this data, and even simpler is good :) It is simple, I just need to know if this automaton I used for AutomatonQuery accepts one stored field, so yes it is the same information as in Term, but I need to run over it once more because my query is not filtering on AutomatonQuery ((AutomatonQuery(A)) OR (OtherQuery) )+ So I get back documents not matched by this Automaton and I do not know which ones are there due to the OtherQuery running search in 2 passes, with and without automaton is not practicable On Wed, Aug 31, 2011 at 8:45 PM, Robert Muirrcm...@gmail.com wrote: On Wed, Aug 31, 2011 at 2:37 PM, eks deveks...@yahoo.co.uk wrote: Keeping AutomatonQuery around came to me as an option, but do not forget, I need Automaton (RunAutomaton) for post processing... There is no way to get Automaton back from the AutomatonQuery? The compiled automaton is not always a RunAutomaton, sometimes its internal representation is something even simpler :) Additionally, when it is a RunAutomaton, its a UTF-8 one, for operating directly on bytes... Can you describe a little bit about what 'post processing' you need to do? I imagine its post processing on something other than the terms? -- lucidimagination.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org