A new document, yes. I should watch my terminology closer. Scott
On Wed, Sep 1, 2010 at 11:53 PM, Lance Norskog <[email protected]> wrote: > Do you mean a new Solr/Lucene index, or a new document with only the > snippet? > > On Wed, Sep 1, 2010 at 5:29 PM, Scott Gonyea <[email protected]> wrote: > > Hi, > > > > I'm looking to get some direction on where I should focus my attention, > with > > regards to the Solr codebase and documentation. Rather than write a ton > of > > stuff no one wants to read, I'll just start with a use-case. For > context, > > the data originates from Nutch crawls and is indexed into Solr. > > > > Imagine a web page has the following content (4 occurences of "Johnson" > are > > bolded): > > > > --content_-- > > Lorem ipsum dolor Johnson sit amet, consectetur adipiscing elit. Aenean > id > > urna et justo fringilla dictum johnson in at tortor. Nulla eu nulla > magna, > > nec sodales est. Sed johnSon sed elit non lorem sagittis fermentum. > Mauris a > > arcu et sem sagittis rhoncus vel malesuada Johnsons mi. Morbi eget ligula > > nisi. Ut fringilla ullamcorper sem. > > --_content-- > > > > First; I would like to have the entire "content" block be indexed within > > Solr. This is done and definitely not an issue. > > > > Second (+); during the injection of crawl data into Solr, I would like to > > grab every occurence of a specific word, or phrase, with "Johnson" being > my > > example for the above. I want to take every such phrase (without > > collision), as well as its unique-context, and inject that into its own, > > separate Solr index. For example, the above "content" example, having > been > > indexed in its entirety, would also be the source of 4 additional > indexes. > > In each index, "Johnson" would only appear once. All of the text before > and > > after "Johnson" would be BOUND BY any other occurrence of "Johnson." eg: > > > > --index1_-- > > Lorem ipsum dolor Johnson sit amet, consectetur adipiscing elit. Aenean > id > > urna et justo fringilla dictum > > --_index1-- --index2_-- > > sit amet, consectetur adipiscing elit. Aenean id urna et justo fringilla > > dictum johnson in at tortor. Nulla eu nulla magna, nec sodales est. Sed > > --_index2-- --index3_-- > > in at tortor. Nulla eu nulla magna, nec sodales est. Sed johnSon sed elit > > non lorem sagittis fermentum. Mauris a arcu et sem sagittis rhoncus vel > > malesuada > > --_index3-- --index4_-- > > sed elit non lorem sagittis fermentum. Mauris a arcu et sem sagittis > rhoncus > > vel malesuada Johnsons mi. Morbi eget ligula nisi. Ut fringilla > ullamcorper > > sem. > > --_index4-- > > > > Q: > > How much of this is feasible in "present-day Solr" and how much of it do > I > > need to produce in a patch of my own? Can anyone give me some direction > on > > where I should look, in approaching this problem (ie, libs / classes / > > confs)? I sincerely appreciate it. > > > > Third; I would later like to go through the above, child indexes and > dismiss > > any that appear within a given context. For example, I may deem "ipsum > > dolor Johnson sit amet" as not being useful and I'd want to delete any > > indexes matching that particular phrase-context. The deletion is trivial > > and, with the 2nd item resolved--this becomes a fairly non-issue. > > > > Q: > > The question, more or less, comes from the fact that my source data is > from > > a web crawler. When recrawled, I need to repeat the process of > dismissing > > phrase-contexts that are not relevant to me. Where is the best place to > > perform this work? I could easily perform queries, after indexing my > crawl, > > but that seems needlessly intensive. I think the answer to that will be > > "wherever I implement #2", but assumptions can be painfully expensive. > > > > > > Thank you for reading my bloated e-mail. Again, I'm mostly just looking > to > > be pointed to various pieces of the Lucene / Solr code-base, and am > trolling > > for any insight that people might share. > > > > Scott Gonyea > > > > > > -- > Lance Norskog > [email protected] > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >
