[ https://issues.apache.org/jira/browse/LUCENE-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719351#action_12719351 ]
Mark Miller commented on LUCENE-1595: ------------------------------------- bq. If we say that contrib in general does not need to maintain back-compat, and we're talking about classes that are in production environments, then I don't think we have a real issue here. Contrib does not necessarily have a back compat, but its up to each contrib to determine what its back compat policy is. Even without an explicit policy, we try to do what make sense. Everytime you ignore back compat completely for certain things, you risk alienating certain people. For example, because the highlighter is a semi core type thing, even though we have never made a back compat policy for it, we don't break back compat there without good reason. I agree that the Benchmark contrib comes down on the low end of concern. In fact, I'm not too concerned with breaking back compat anywhere in benchmark except for the algs. Every time we break the algs, we risk causing people who have written custom algs to think twice about writing and maintaining them. Generally, I'd expect things like that to be careful about maintaining back compat or a mode that can run older version algs. That said, I'm not saying this change isnt worth a little algorithm disruption. I wouldn't mind getting the opinion of another committer first though. I doubt too many people even have that many custom algs out there - but thats not a scenioro I want to help and try to perpetuate. So its not like I'm sitting here saying this is a huge deal - but I think it should def be considered a bit. > Split DocMaker into ContentSource and DocMaker > ---------------------------------------------- > > Key: LUCENE-1595 > URL: https://issues.apache.org/jira/browse/LUCENE-1595 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/benchmark > Reporter: Shai Erera > Assignee: Mark Miller > Fix For: 2.9 > > Attachments: LUCENE-1595.patch, LUCENE-1595.patch, LUCENE-1595.patch > > > This issue proposes some refactoring to the benchmark package. Today, > DocMaker has two roles: collecting documents from a collection and preparing > a Document object. These two should actually be split up to ContentSource and > DocMaker, which will use a ContentSource instance. > ContentSource will implement all the methods of DocMaker, like > getNextDocData, raw size in bytes tracking etc. This can actually fit well w/ > 1591, by having a basic ContentSource that offers input stream services, and > wraps a file (for example) with a bzip or gzip streams etc. > DocMaker will implement the makeDocument methods, reusing DocState etc. > The idea is that collecting the Enwiki documents, for example, should be the > same whether I create documents using DocState, add payloads or index > additional metadata. Same goes for Trec and Reuters collections, as well as > LineDocMaker. > In fact, if one inspects EnwikiDocMaker and LineDocMaker closely, they are > 99% the same and 99% different. Most of their differences lie in the way they > read the data, while most of the similarity lies in the way they create > documents (using DocState). > That led to a somehwat bizzare extension of LineDocMaker by EnwikiDocMaker > (just the reuse of DocState). Also, other DocMakers do not use that DocState > today, something they could have gotten for free with this refactoring > proposed. > So by having a EnwikiContentSource, ReutersContentSource and others (TREC, > Line, Simple), I can write several DocMakers, such as DocStateMaker, > ConfigurableDocMaker (one which accpets all kinds of config options) and > custom DocMakers (payload, facets, sorting), passing to them a ContentSource > instance and reuse the same DocMaking algorithm with many content sources, as > well as the same ContentSource algorithm with many DocMaker implementations. > This will also give us the opportunity to perf test content sources alone > (i.e., compare bzip, gzip and regular input streams), w/o the overhead of > creating a Document object. > I've already done so in my code environment (I extend the benchmark package > for my application's purposes) and I like the flexibility I have. I think > this can be a nice contribution to the benchmark package, which can result in > some code cleanup as well. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org