Bugs item #993099, was opened at 2004-07-18 01:19
Message generated for change (Comment added) made by joa23
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=491356&aid=993099&group_id=59548

Category: plugin: parse-html
Group: None
>Status: Closed
>Resolution: Rejected
Priority: 5
Submitted By: Jungshik Shin (jshin)
Assigned to: Nobody/Anonymous (nobody)
Summary: The content of alt, longdesc, title attribute aren't stored 

Initial Comment:
'alt', 'title' and 'longdesc' can contain valuable
information, but their values are not currently stored
by nutch. 
Those attributes are used to describe the content of
image files, audio files, video files, and other
non-textual data for those who can't view/listen
to/watch them tat include not just the blind/the deaf
and text browser users but also crawlers like Nutch.
Note that 'alt' is mandatory for 'img' tag in HTML 4.x
or XHTML 1.x because it's an important accessibility
feature (see http://www.w3.org/WAI). 

When the web accessibility is promoted, it's almost
always mentioned that the accessibility is not only for
the disabled but also for machine agents (crawlers,
search engines, etc). That is, 'alt', 'longdesc',
'title' are there for Nutch to take advantage of and
Nutch should do that. 

The other day, while 'browsing' web pages Nutch
collected (both raw html files and text summary files),
I hit upon a page with two dozens of jpeg photos with
detailed descriptions for them in longdesc and title.
Nutch doesn't have any of description in text summary.  


Attached is my patch that uses regex. I'm not sure
whether regex is too slow for this, in which case I'll
make it use 'Set'. 


----------------------------------------------------------------------

>Comment By: Stefan Groschupf (joa23)
Date: 2005-03-10 20:17

Message:
Logged In: YES 
user_id=396197

see last commet by Doug, please update your patch against the latest 
source code in apches subversion or write a own index plugin.
please submit a new patch to the new issue tracking:
http://issues.apache.org/jira/browse/Nutch

----------------------------------------------------------------------

Comment By: Jungshik Shin (jshin)
Date: 2004-07-20 04:53

Message:
Logged In: YES 
user_id=307557

Thanks for taking a look. I agree with you on all three
points. I'll make changes you suggested later this week and
upload a new patch. 

----------------------------------------------------------------------

Comment By: Doug Cutting (cutting)
Date: 2004-07-19 18:49

Message:
Logged In: YES 
user_id=21778

I agree that we should fix this, but I'm not sure this patch
is quite ready.

First, it is against an old version of the code, not the
latest CVS, I think.

Second, I think that, instead of regex, a Set of names to
check would be much faster.  And shouldn't this check be
case-insensitive?  If so, then the set should use
String.CASE_INSENSITIVE_ORDER.

Third, should we only index these attributes where they're
legal?  According to
http://www.w3.org/TR/html4/index/attributes.html, LONGDESC
is only allowed in IMG, ALT is only allowed in IMG, AREA,
APPLET, and INPUT, and TITLE is allowed almost everywhere. 
If we allow them everywhere we might be tempting spammers...

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=491356&aid=993099&group_id=59548


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to