Re: [fcrepo-user] Web archive CModels

Nick Ruest Thu, 07 Mar 2013 11:24:21 -0800

Hi Mark,

On 13-03-05 12:05 PM, Mark Leggott wrote:
> This is a good discussion - I like the direction you are heading to Nick, and 
> Ben's observations are spot on. The flexibility with the 2nd approach would 
> be considerable and I'm not sure the effort needed to present various access 
> options would be that much greater than with the 1st option. I think it also 
> makes it "cleaner" to have a metadata record for each crawl, which would 
> provide significant advantages.
>
> We are holding Islandora Camp in Europe in 2 weeks, so I will make an effort 
> to have this discussion in one of the unConference sessions: I suspect some 
> of the people attending will have some WARC experience. Depending on how far 
> you get in the next couple of months this might also be a good suggestion for 
> the Dev Challenge for OR2013 in July. We haven't sent out the Challenge 
> details but there will be a hackfest component this year and we will be 
> soliciting ideas in the near future.
>


This sounds great! I started working on incorporating the additional 
datastreams today, and I'd love to collaborate with folks on this.

Thanks for the support!

-nruest
> Mark
>
> On 2013-03-05, at 12:57 AM, Nick Ruest <rue...@gmail.com> wrote:
>
>> The biggest concerns I have about number one are how unruly it could get
>> very quickly given some sites have to be archived 5 days a week; Monday
>> - Friday. That would be 261 versions of an object/year.
>>
>> What I had worked out in my head, was a combination of 1 & 2. Where I
>> create a collection object (Islandora parlance) and then and object for
>> every crawl, which in turn would be a child of the collection object.
>>
>> I think if I go that direction, I would be able to quickly display the
>> number of crawls for a given site, then provide some "serendipitous"
>> browse options for a user - paging through screenshots, date lists,
>> full-text searching, etc. I just really have to figure out how to get
>> fcrepo and the wayback machine to talk to each other :-)
>>
>> Also, FWIW we had a fruitful discussion[1] on Twitter early if anybody
>> is interest in the thread. It seemed to jump off list and back on.
>>
>> Thanks for jumping up Ben!
>>
>> -nruest
>>
>> [1] https://twitter.com/mjgiarlo/status/308788759246303233
>>
>> On 13-03-04 11:36 PM, Benjamin Armintor wrote:
>>> I can immediately think of two approaches I'd consider:
>>> 1. Each Fedora object represents a website. There is a content DS that
>>> is a WARC; it is versioned. There is a screenshot DS and a full-text DS-
>>> these are unversioned and presumed to relate to the most recent WARC.
>>> You could have a similar DS for descriptive metadata. This model hinges
>>> on being able to assume the non-WARC datastreams only really have a
>>> relationship to the object, not a particular crawl.
>>>
>>> 2. Every crawl is an object, meant never to be altered. The related DSs
>>> might change between versions- you would have to directly compare the
>>> analogous DSs in two objects. You would refer to the website across all
>>> versions either in a metadatum (a RELS-EXT relationship, or something
>>> consistent in the descMetadata like a MODS relatedItem) or as an
>>> umbrella object across all the versions (obviously still marking up the
>>> isVersionOf/isMemberOf relationships, but having a place to locate the
>>> generic descriptions separate from the crawl descriptions).
>>>
>>> The latter is probably more "correct", but it will also be more
>>> cumbersome to work with.  If you wanted to actually extract resources
>>> from a WARC (a particular file asset, for example), I think you really
>>> have to follow a plan like the second option.
>>>
>>> - Ben
>>>
>>>
>>> On Mon, Mar 4, 2013 at 10:56 PM, Nick Ruest <rue...@gmail.com
>>> <mailto:rue...@gmail.com>> wrote:
>>>
>>>     Hi folks,
>>>
>>>     I began working on an Islandora Solution Pack for web archives a while
>>>     back, and the more I work on it and think about it I'm a little stuck on
>>>     an foundational aspect, what is the object?
>>>
>>>     The way I had initially constructed it as a proof of concept was just
>>>     ingesting and disseminating warc files. But, as I learn more and more
>>>     about web archiving, there is more I'd like to do dissemination wise
>>>     with associated datastreams (screenshots, pdfs) and full-text searching
>>>     of warcs.
>>>
>>>     So, here is my issue. Is an object a given crawl of a site? For example
>>>     web crawl of http://yfile.news.yorku.ca on March 4, 2013? Or is an
>>>     object a given website, the yfile example, and each crawl is a version
>>>     of a datastream?
>>>
>>>     To me it all seems like a matter of how a given collection is arranged
>>>     and described, and both solutions are technically correct. But, is one
>>>     way better than the other?
>>>
>>>     If you'll indulge me, I'd love to hear your input.
>>>
>>>     cheers!
>>>
>>>     --
>>>     -nruest
>>>
>>>     
>>> ------------------------------------------------------------------------------
>>>     Everyone hates slow websites. So do we.
>>>     Make your web apps faster with AppDynamics
>>>     Download AppDynamics Lite for free today:
>>>     http://p.sf.net/sfu/appdyn_d2d_feb
>>>     _______________________________________________
>>>     Fedora-commons-users mailing list
>>>     Fedora-commons-users@lists.sourceforge.net
>>>     <mailto:Fedora-commons-users@lists.sourceforge.net>
>>>     https://lists.sourceforge.net/lists/listinfo/fedora-commons-users
>>>
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Everyone hates slow websites. So do we.
>>> Make your web apps faster with AppDynamics
>>> Download AppDynamics Lite for free today:
>>> http://p.sf.net/sfu/appdyn_d2d_feb
>>>
>>>
>>>
>>> _______________________________________________
>>> Fedora-commons-users mailing list
>>> Fedora-commons-users@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/fedora-commons-users
>>>
>>
>> --
>> -nruest
>>
>> ------------------------------------------------------------------------------
>> Everyone hates slow websites. So do we.
>> Make your web apps faster with AppDynamics
>> Download AppDynamics Lite for free today:
>> http://p.sf.net/sfu/appdyn_d2d_feb
>> _______________________________________________
>> Fedora-commons-users mailing list
>> Fedora-commons-users@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/fedora-commons-users
>
>
> ------------------------------------------------------------------------------
> Everyone hates slow websites. So do we.
> Make your web apps faster with AppDynamics
> Download AppDynamics Lite for free today:
> http://p.sf.net/sfu/appdyn_d2d_feb
> _______________________________________________
> Fedora-commons-users mailing list
> Fedora-commons-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/fedora-commons-users
>

------------------------------------------------------------------------------
Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester  
Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the  
endpoint security space. For insight on selecting the right partner to 
tackle endpoint security challenges, access the full report. 
http://p.sf.net/sfu/symantec-dev2dev
_______________________________________________
Fedora-commons-users mailing list
Fedora-commons-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/fedora-commons-users

Re: [fcrepo-user] Web archive CModels

Reply via email to