[
http://issues.apache.org/jira/browse/NUTCH-59?page=comments#action_12364122 ]
James Jonas commented on NUTCH-59:
----------------------------------
I would like to offer my vote for Nutch-59 (+1)
I do have some comments with regards to the metadata infrastructure in Nutch.
Here are some of my thoughts.
Storing metadata in WebDB does offer the potential for a long list of potential
new uses for Nutch.
- Location based Queries
- The topic of the page relates to what city, state, country, geospatial
coordinate
- This page has multiple locations (list of WalMart Stores)
- The server is located in this country (legal domicile)
- The content is targeted to this geographic group (middle east, east chicago)
- this particular location has this list of websites associated with it
(garage.com has invested in X companies located in this area)
- Directions to the store on the website (mapquest)
- List of other website/store in the area (google local)
...
- People and Organizations
- whois info
- webmaster
- editor(s)
- company that owns the website
- group within the company that owns the website.
There are several other metadata classes that can be associated with a page.
- Dublin Core (as mentioned in other Nutch requirement docs)
- CWM - Common Warehouse Metadat - provide links for datawarehouse (datamart)
information to a web page.
- Products (Froogle, Business.com...)
As well as new forms of popular website technologies, each which contain a set
of unique metadata.
- wiki (license, topic...)
- blog (topic, person, group...)
- personal profiles (dating, facebook.com)
- ontologies (dmoz, jena - owl, wordnet)
- ...
Unstructured data (the web) contains a long list of course grained classes of
metadata that can be associated with each Page (artifact).
A CONCEPTUAL META-MODEL FOR UNSTRUCTURED DATA
http://www.tdan.com/i024fe01.htm
The models that persist metadata can become very complex.
A UNIVERSAL PERSON AND ORGANIZATION DATA MODEL:
THE PARTY/PARTY-RELATIONSHIP PATTERN
http://www.tdan.com/i021ht04.htm
As well as the repositories that persist this type of data:
Advanced Meta Data Architecture
http://www.tdan.com/i013fe01.htm
Summary:
- large number of types of metadata
- metadata models can be complex
- number of different archtectures for storing of metadata
- persisting metadata can be costly (query time, updates...)
Some Options
(1) WebDb Metadata Storage (changes to index,queryfilter..)
- Nutch-59
- Nutch-139
...
with tools and plugins
- Ontologies
- Geospacial
...
(2) Internal Metadata Store - Create a MetaDB store that provides local storage
of denomalized metadata in Lucene. This could use an optimized subset of a
Metadata API.
(3) Metadata API - Formal API from Nutch into other external Metadata
Repositories (lucene, mysql, DB2, Jena (OWL), GIS ...)
Issues to consider:
- persisting metadata in WebDb/Index offers faster queries
- as metadata becomes large and more complex and the number of pages increases
(50mm - 6 billion) updates and searches will suffer
- use of external stores will impact any processes that require a call to that
store
- external metadata stores can persist more complex forms of metadata
- Lucene, which is optimized for unstructured data may not be the best
persistent mechanism for complex metadata
Feedback:
Please tell me if I'm close with regards to articulating the some of the issues
that may need to be considered in defining a metadata architecture for Nutch.
Suggesting solutions (Metadata API and MetaDB) at this stage is only to enhance
discussion. A few more iterations on a requirement for a broader metadata
architecture is necessary before we start laying down concreate solutions.
Thanks,
James
> meta data support in webdb
> --------------------------
>
> Key: NUTCH-59
> URL: http://issues.apache.org/jira/browse/NUTCH-59
> Project: Nutch
> Type: New Feature
> Reporter: Stefan Groschupf
> Priority: Minor
> Attachments: webDBMetaDataPatch.txt
>
> Meta data support in web db would very usefully for a new set of nutch
> feature that needs long life meta data.
> Actually page meta data need to be regenerated or lookup every 30 days a page
> is re-fetched, in a long context web db meta data would bring a dramatically
> performance improvement for such tasks.
> Furthermore Storage of meta data in webdb would make a new generation of
> linklist generation filters possible.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers