Re: Field mapping for RSS feed

Karl Wright Wed, 03 Aug 2011 00:03:26 -0700

All fixes for the ticket are complete.
Of course, in order to use them you will want to build and use trunk
instead of the 0.2-incubating release.  Let me know if this is a
problem.


Thanks!
Karl

On Tue, Aug 2, 2011 at 3:04 PM, Karl Wright <daddy...@gmail.com> wrote:
> Hi Kate,
>
> Many news RSS feeds put the full article in either the item
> description or the item content field, while the document described by
> the url field is not just straight content but contains navigation and
> advertising "chrome".  In such cases it's often preferable to generate
> an index based on the description or content field contents rather
> than the actual document with all of that chrome.  The Dechromed
> Content options allow you to set up that behavior for a specific job.
>
> Thanks for opening the ticket; I'll propose a solution shortly.
>
> Karl
>
>
> On Tue, Aug 2, 2011 at 2:56 PM, K McGonigal <kmcgon...@gmail.com> wrote:
>> Hi Karl,
>>
>> Thank you for your quick response. I've opened a Jira ticket for this,
>> though I don't really understand what sort of solution you had in mind so I
>> didn't propose anything.
>>
>> I'm afraid I don't understand exactly what the Dechromed Content options do
>> either. I read about them in the End User Documentation, but there wasn't
>> much there yet.
>>
>> I find it odd that I would be the first person to have this problem. You'd
>> think it would be very common.
>>
>>
>> Kate
>>
>>
>> On Tue, Aug 2, 2011 at 11:05 AM, Karl Wright <daddy...@gmail.com> wrote:
>>>
>>> I just looked at the code.  It's not a bug rather than an oversight of
>>> sorts.  The "description" or "content" fields are indexed as the
>>> primary content of the document if the "chrome" mode is selected
>>> accordingly.  If "None" is the "chrome" mode, then the item-level
>>> description field is ignored even when present.
>>>
>>> So I recommend simply adding a new kind of "description" field for
>>> when the "chrome" mode is set to "None".  "item/description" may be
>>> its name, or maybe the full XPath, your choice.  Propose something in
>>> the ticket and I'll respond.
>>>
>>> Thanks!
>>> Karl
>>>
>>>
>>> On Tue, Aug 2, 2011 at 11:47 AM, Karl Wright <daddy...@gmail.com> wrote:
>>> > Hi Kate,
>>> >
>>> > The field mapping won't do the trick because the RSS connector is
>>> > currently very selective about what fields it extracts - it by no
>>> > means extracts all of them, so the ones that it *does* extract from
>>> > the feed are "special".
>>> >
>>> > The behavior you describe sounds like a bug to me.  I'll go spelunking
>>> > through the code at first opportunity.  In the meantime, could you
>>> > create a Jira ticket describing the behavior you see vs. the behavior
>>> > you want?
>>> >
>>> > Thanks!
>>> > Karl
>>> >
>>> > On Tue, Aug 2, 2011 at 11:41 AM, K McGonigal <kmcgon...@gmail.com>
>>> > wrote:
>>> >> Hi,
>>> >>
>>> >> I'm trying to use ManifoldCF to index an RSS feed into Solr.  It sort
>>> >> of
>>> >> works, but my main problem at the moment is that the *channel*
>>> >> description
>>> >> from the RSS feed is written to the "description" field in Solr when I
>>> >> would
>>> >> really like the *item* description to be written instead.
>>> >>
>>> >> I have a typical RSS feed with the general structure:
>>> >>
>>> >> <rss>
>>> >>     <channel>
>>> >>         <title></title>
>>> >>         <link></link>
>>> >>         <description> *** the description I don't want ***
>>> >> </description>
>>> >>         <item>
>>> >>             <title></title>
>>> >>             <link></link>
>>> >>             <pubDate></pubDate>
>>> >>             <description> *** the description I do want ***
>>> >> </description>
>>> >>             <author></author>
>>> >>             <category></category>
>>> >>         </item>
>>> >>     </channel>
>>> >> </rss>
>>> >>
>>> >> I tried setting up the  field mapping on the job with the XPath address
>>> >> of
>>> >> the second description, i.e. "/rss/channel/item/description" as the
>>> >> source,
>>> >> but that did not work.
>>> >>
>>> >> I suspect I'm overlooking something simple, but I've spent 2 days
>>> >> trying to
>>> >> solve it.  I would be grateful for any help.
>>> >>
>>> >>
>>> >> Kate McGonigal
>>> >>
>>> >>
>>> >>
>>> >
>>
>>
>

Re: Field mapping for RSS feed

Reply via email to