[
https://issues.apache.org/jira/browse/LUCENE-4590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13511264#comment-13511264
]
Shai Erera commented on LUCENE-4590:
------------------------------------
bq. That's what I planned at start, but decided to leave WriteLineDoc intact
because it is general, that is, not aware of the unique structure of Wikipedia
data, where some of the pages represent categories.
I think that you misunderstood me, or I wasn't clear enough. WriteLineDoc would
not change, EnwikiContentSource would. If someone is interested in creating a
line file over all Wikipedia pages, he'll put in his .alg something like
content.source=EnwikiContentSource and
{{enwiki.source.exclude.categories=false}}, otherwise
{{enwiki.source.exclude.categories=true}}. WriteLineDocTask would still write
the DocData that the source writes.
EnwikiContentSource will return either DocData or CategoryDocData, or a single
object EnwikiDocData with an extra boolean isCategory. WriteLineDoc will still
read just the DocData fields it knows about. WriteEnwikiLineDoc will write the
DocData to the relevant file, per isCategory.
bq. Actually I am after the two files
I know :). I don't propose anything different, just discussing how the code
could be designed to achieve that, and as a bonus, allow someone to exclude
from regular benchmarks the category pages.
> WriteEnwikiLineDoc which writes Wikipedia category pages to a separate file
> ---------------------------------------------------------------------------
>
> Key: LUCENE-4590
> URL: https://issues.apache.org/jira/browse/LUCENE-4590
> Project: Lucene - Core
> Issue Type: New Feature
> Components: modules/benchmark
> Reporter: Doron Cohen
> Assignee: Doron Cohen
> Priority: Minor
>
> It may be convenient to split Wikipedia's line file into two separate files:
> category-pages and non-category ones.
> It is possible to split the original line file with grep or such.
> It is more efficient to do it in advance.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]