Do you mean a new Solr/Lucene index, or a new document with only the snippet?
On Wed, Sep 1, 2010 at 5:29 PM, Scott Gonyea <[email protected]> wrote: > Hi, > > I'm looking to get some direction on where I should focus my attention, with > regards to the Solr codebase and documentation. Rather than write a ton of > stuff no one wants to read, I'll just start with a use-case. For context, > the data originates from Nutch crawls and is indexed into Solr. > > Imagine a web page has the following content (4 occurences of "Johnson" are > bolded): > > --content_-- > Lorem ipsum dolor Johnson sit amet, consectetur adipiscing elit. Aenean id > urna et justo fringilla dictum johnson in at tortor. Nulla eu nulla magna, > nec sodales est. Sed johnSon sed elit non lorem sagittis fermentum. Mauris a > arcu et sem sagittis rhoncus vel malesuada Johnsons mi. Morbi eget ligula > nisi. Ut fringilla ullamcorper sem. > --_content-- > > First; I would like to have the entire "content" block be indexed within > Solr. This is done and definitely not an issue. > > Second (+); during the injection of crawl data into Solr, I would like to > grab every occurence of a specific word, or phrase, with "Johnson" being my > example for the above. I want to take every such phrase (without > collision), as well as its unique-context, and inject that into its own, > separate Solr index. For example, the above "content" example, having been > indexed in its entirety, would also be the source of 4 additional indexes. > In each index, "Johnson" would only appear once. All of the text before and > after "Johnson" would be BOUND BY any other occurrence of "Johnson." eg: > > --index1_-- > Lorem ipsum dolor Johnson sit amet, consectetur adipiscing elit. Aenean id > urna et justo fringilla dictum > --_index1-- --index2_-- > sit amet, consectetur adipiscing elit. Aenean id urna et justo fringilla > dictum johnson in at tortor. Nulla eu nulla magna, nec sodales est. Sed > --_index2-- --index3_-- > in at tortor. Nulla eu nulla magna, nec sodales est. Sed johnSon sed elit > non lorem sagittis fermentum. Mauris a arcu et sem sagittis rhoncus vel > malesuada > --_index3-- --index4_-- > sed elit non lorem sagittis fermentum. Mauris a arcu et sem sagittis rhoncus > vel malesuada Johnsons mi. Morbi eget ligula nisi. Ut fringilla ullamcorper > sem. > --_index4-- > > Q: > How much of this is feasible in "present-day Solr" and how much of it do I > need to produce in a patch of my own? Can anyone give me some direction on > where I should look, in approaching this problem (ie, libs / classes / > confs)? I sincerely appreciate it. > > Third; I would later like to go through the above, child indexes and dismiss > any that appear within a given context. For example, I may deem "ipsum > dolor Johnson sit amet" as not being useful and I'd want to delete any > indexes matching that particular phrase-context. The deletion is trivial > and, with the 2nd item resolved--this becomes a fairly non-issue. > > Q: > The question, more or less, comes from the fact that my source data is from > a web crawler. When recrawled, I need to repeat the process of dismissing > phrase-contexts that are not relevant to me. Where is the best place to > perform this work? I could easily perform queries, after indexing my crawl, > but that seems needlessly intensive. I think the answer to that will be > "wherever I implement #2", but assumptions can be painfully expensive. > > > Thank you for reading my bloated e-mail. Again, I'm mostly just looking to > be pointed to various pieces of the Lucene / Solr code-base, and am trolling > for any insight that people might share. > > Scott Gonyea > -- Lance Norskog [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
