This is a good discussion - I like the direction you are heading to Nick, and 
Ben's observations are spot on. The flexibility with the 2nd approach would be 
considerable and I'm not sure the effort needed to present various access 
options would be that much greater than with the 1st option. I think it also 
makes it "cleaner" to have a metadata record for each crawl, which would 
provide significant advantages.

We are holding Islandora Camp in Europe in 2 weeks, so I will make an effort to 
have this discussion in one of the unConference sessions: I suspect some of the 
people attending will have some WARC experience. Depending on how far you get 
in the next couple of months this might also be a good suggestion for the Dev 
Challenge for OR2013 in July. We haven't sent out the Challenge details but 
there will be a hackfest component this year and we will be soliciting ideas in 
the near future.

Mark

On 2013-03-05, at 12:57 AM, Nick Ruest <rue...@gmail.com> wrote:

> The biggest concerns I have about number one are how unruly it could get 
> very quickly given some sites have to be archived 5 days a week; Monday 
> - Friday. That would be 261 versions of an object/year.
> 
> What I had worked out in my head, was a combination of 1 & 2. Where I 
> create a collection object (Islandora parlance) and then and object for 
> every crawl, which in turn would be a child of the collection object.
> 
> I think if I go that direction, I would be able to quickly display the 
> number of crawls for a given site, then provide some "serendipitous" 
> browse options for a user - paging through screenshots, date lists, 
> full-text searching, etc. I just really have to figure out how to get 
> fcrepo and the wayback machine to talk to each other :-)
> 
> Also, FWIW we had a fruitful discussion[1] on Twitter early if anybody 
> is interest in the thread. It seemed to jump off list and back on.
> 
> Thanks for jumping up Ben!
> 
> -nruest
> 
> [1] https://twitter.com/mjgiarlo/status/308788759246303233
> 
> On 13-03-04 11:36 PM, Benjamin Armintor wrote:
>> I can immediately think of two approaches I'd consider:
>> 1. Each Fedora object represents a website. There is a content DS that
>> is a WARC; it is versioned. There is a screenshot DS and a full-text DS-
>> these are unversioned and presumed to relate to the most recent WARC.
>> You could have a similar DS for descriptive metadata. This model hinges
>> on being able to assume the non-WARC datastreams only really have a
>> relationship to the object, not a particular crawl.
>> 
>> 2. Every crawl is an object, meant never to be altered. The related DSs
>> might change between versions- you would have to directly compare the
>> analogous DSs in two objects. You would refer to the website across all
>> versions either in a metadatum (a RELS-EXT relationship, or something
>> consistent in the descMetadata like a MODS relatedItem) or as an
>> umbrella object across all the versions (obviously still marking up the
>> isVersionOf/isMemberOf relationships, but having a place to locate the
>> generic descriptions separate from the crawl descriptions).
>> 
>> The latter is probably more "correct", but it will also be more
>> cumbersome to work with.  If you wanted to actually extract resources
>> from a WARC (a particular file asset, for example), I think you really
>> have to follow a plan like the second option.
>> 
>> - Ben
>> 
>> 
>> On Mon, Mar 4, 2013 at 10:56 PM, Nick Ruest <rue...@gmail.com
>> <mailto:rue...@gmail.com>> wrote:
>> 
>>    Hi folks,
>> 
>>    I began working on an Islandora Solution Pack for web archives a while
>>    back, and the more I work on it and think about it I'm a little stuck on
>>    an foundational aspect, what is the object?
>> 
>>    The way I had initially constructed it as a proof of concept was just
>>    ingesting and disseminating warc files. But, as I learn more and more
>>    about web archiving, there is more I'd like to do dissemination wise
>>    with associated datastreams (screenshots, pdfs) and full-text searching
>>    of warcs.
>> 
>>    So, here is my issue. Is an object a given crawl of a site? For example
>>    web crawl of http://yfile.news.yorku.ca on March 4, 2013? Or is an
>>    object a given website, the yfile example, and each crawl is a version
>>    of a datastream?
>> 
>>    To me it all seems like a matter of how a given collection is arranged
>>    and described, and both solutions are technically correct. But, is one
>>    way better than the other?
>> 
>>    If you'll indulge me, I'd love to hear your input.
>> 
>>    cheers!
>> 
>>    --
>>    -nruest
>> 
>>    
>> ------------------------------------------------------------------------------
>>    Everyone hates slow websites. So do we.
>>    Make your web apps faster with AppDynamics
>>    Download AppDynamics Lite for free today:
>>    http://p.sf.net/sfu/appdyn_d2d_feb
>>    _______________________________________________
>>    Fedora-commons-users mailing list
>>    Fedora-commons-users@lists.sourceforge.net
>>    <mailto:Fedora-commons-users@lists.sourceforge.net>
>>    https://lists.sourceforge.net/lists/listinfo/fedora-commons-users
>> 
>> 
>> 
>> 
>> ------------------------------------------------------------------------------
>> Everyone hates slow websites. So do we.
>> Make your web apps faster with AppDynamics
>> Download AppDynamics Lite for free today:
>> http://p.sf.net/sfu/appdyn_d2d_feb
>> 
>> 
>> 
>> _______________________________________________
>> Fedora-commons-users mailing list
>> Fedora-commons-users@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/fedora-commons-users
>> 
> 
> -- 
> -nruest
> 
> ------------------------------------------------------------------------------
> Everyone hates slow websites. So do we.
> Make your web apps faster with AppDynamics
> Download AppDynamics Lite for free today:
> http://p.sf.net/sfu/appdyn_d2d_feb
> _______________________________________________
> Fedora-commons-users mailing list
> Fedora-commons-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/fedora-commons-users


------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_feb
_______________________________________________
Fedora-commons-users mailing list
Fedora-commons-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/fedora-commons-users

Reply via email to