I can immediately think of two approaches I'd consider:
1. Each Fedora object represents a website. There is a content DS that is a
WARC; it is versioned. There is a screenshot DS and a full-text DS- these
are unversioned and presumed to relate to the most recent WARC. You could
have a similar DS for descriptive metadata. This model hinges on being able
to assume the non-WARC datastreams only really have a relationship to the
object, not a particular crawl.

2. Every crawl is an object, meant never to be altered. The related DSs
might change between versions- you would have to directly compare the
analogous DSs in two objects. You would refer to the website across all
versions either in a metadatum (a RELS-EXT relationship, or something
consistent in the descMetadata like a MODS relatedItem) or as an umbrella
object across all the versions (obviously still marking up the
isVersionOf/isMemberOf relationships, but having a place to locate the
generic descriptions separate from the crawl descriptions).

The latter is probably more "correct", but it will also be more cumbersome
to work with.  If you wanted to actually extract resources from a WARC (a
particular file asset, for example), I think you really have to follow a
plan like the second option.

- Ben


On Mon, Mar 4, 2013 at 10:56 PM, Nick Ruest <rue...@gmail.com> wrote:

> Hi folks,
>
> I began working on an Islandora Solution Pack for web archives a while
> back, and the more I work on it and think about it I'm a little stuck on
> an foundational aspect, what is the object?
>
> The way I had initially constructed it as a proof of concept was just
> ingesting and disseminating warc files. But, as I learn more and more
> about web archiving, there is more I'd like to do dissemination wise
> with associated datastreams (screenshots, pdfs) and full-text searching
> of warcs.
>
> So, here is my issue. Is an object a given crawl of a site? For example
> web crawl of http://yfile.news.yorku.ca on March 4, 2013? Or is an
> object a given website, the yfile example, and each crawl is a version
> of a datastream?
>
> To me it all seems like a matter of how a given collection is arranged
> and described, and both solutions are technically correct. But, is one
> way better than the other?
>
> If you'll indulge me, I'd love to hear your input.
>
> cheers!
>
> --
> -nruest
>
>
> ------------------------------------------------------------------------------
> Everyone hates slow websites. So do we.
> Make your web apps faster with AppDynamics
> Download AppDynamics Lite for free today:
> http://p.sf.net/sfu/appdyn_d2d_feb
> _______________________________________________
> Fedora-commons-users mailing list
> Fedora-commons-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/fedora-commons-users
>
------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_feb
_______________________________________________
Fedora-commons-users mailing list
Fedora-commons-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/fedora-commons-users

Reply via email to