Hi Mark, On 13-03-05 12:05 PM, Mark Leggott wrote: > This is a good discussion - I like the direction you are heading to Nick, and > Ben's observations are spot on. The flexibility with the 2nd approach would > be considerable and I'm not sure the effort needed to present various access > options would be that much greater than with the 1st option. I think it also > makes it "cleaner" to have a metadata record for each crawl, which would > provide significant advantages. > > We are holding Islandora Camp in Europe in 2 weeks, so I will make an effort > to have this discussion in one of the unConference sessions: I suspect some > of the people attending will have some WARC experience. Depending on how far > you get in the next couple of months this might also be a good suggestion for > the Dev Challenge for OR2013 in July. We haven't sent out the Challenge > details but there will be a hackfest component this year and we will be > soliciting ideas in the near future. >
This sounds great! I started working on incorporating the additional datastreams today, and I'd love to collaborate with folks on this. Thanks for the support! -nruest > Mark > > On 2013-03-05, at 12:57 AM, Nick Ruest <rue...@gmail.com> wrote: > >> The biggest concerns I have about number one are how unruly it could get >> very quickly given some sites have to be archived 5 days a week; Monday >> - Friday. That would be 261 versions of an object/year. >> >> What I had worked out in my head, was a combination of 1 & 2. Where I >> create a collection object (Islandora parlance) and then and object for >> every crawl, which in turn would be a child of the collection object. >> >> I think if I go that direction, I would be able to quickly display the >> number of crawls for a given site, then provide some "serendipitous" >> browse options for a user - paging through screenshots, date lists, >> full-text searching, etc. I just really have to figure out how to get >> fcrepo and the wayback machine to talk to each other :-) >> >> Also, FWIW we had a fruitful discussion[1] on Twitter early if anybody >> is interest in the thread. It seemed to jump off list and back on. >> >> Thanks for jumping up Ben! >> >> -nruest >> >> [1] https://twitter.com/mjgiarlo/status/308788759246303233 >> >> On 13-03-04 11:36 PM, Benjamin Armintor wrote: >>> I can immediately think of two approaches I'd consider: >>> 1. Each Fedora object represents a website. There is a content DS that >>> is a WARC; it is versioned. There is a screenshot DS and a full-text DS- >>> these are unversioned and presumed to relate to the most recent WARC. >>> You could have a similar DS for descriptive metadata. This model hinges >>> on being able to assume the non-WARC datastreams only really have a >>> relationship to the object, not a particular crawl. >>> >>> 2. Every crawl is an object, meant never to be altered. The related DSs >>> might change between versions- you would have to directly compare the >>> analogous DSs in two objects. You would refer to the website across all >>> versions either in a metadatum (a RELS-EXT relationship, or something >>> consistent in the descMetadata like a MODS relatedItem) or as an >>> umbrella object across all the versions (obviously still marking up the >>> isVersionOf/isMemberOf relationships, but having a place to locate the >>> generic descriptions separate from the crawl descriptions). >>> >>> The latter is probably more "correct", but it will also be more >>> cumbersome to work with. If you wanted to actually extract resources >>> from a WARC (a particular file asset, for example), I think you really >>> have to follow a plan like the second option. >>> >>> - Ben >>> >>> >>> On Mon, Mar 4, 2013 at 10:56 PM, Nick Ruest <rue...@gmail.com >>> <mailto:rue...@gmail.com>> wrote: >>> >>> Hi folks, >>> >>> I began working on an Islandora Solution Pack for web archives a while >>> back, and the more I work on it and think about it I'm a little stuck on >>> an foundational aspect, what is the object? >>> >>> The way I had initially constructed it as a proof of concept was just >>> ingesting and disseminating warc files. But, as I learn more and more >>> about web archiving, there is more I'd like to do dissemination wise >>> with associated datastreams (screenshots, pdfs) and full-text searching >>> of warcs. >>> >>> So, here is my issue. Is an object a given crawl of a site? For example >>> web crawl of http://yfile.news.yorku.ca on March 4, 2013? Or is an >>> object a given website, the yfile example, and each crawl is a version >>> of a datastream? >>> >>> To me it all seems like a matter of how a given collection is arranged >>> and described, and both solutions are technically correct. But, is one >>> way better than the other? >>> >>> If you'll indulge me, I'd love to hear your input. >>> >>> cheers! >>> >>> -- >>> -nruest >>> >>> >>> ------------------------------------------------------------------------------ >>> Everyone hates slow websites. So do we. >>> Make your web apps faster with AppDynamics >>> Download AppDynamics Lite for free today: >>> http://p.sf.net/sfu/appdyn_d2d_feb >>> _______________________________________________ >>> Fedora-commons-users mailing list >>> Fedora-commons-users@lists.sourceforge.net >>> <mailto:Fedora-commons-users@lists.sourceforge.net> >>> https://lists.sourceforge.net/lists/listinfo/fedora-commons-users >>> >>> >>> >>> >>> ------------------------------------------------------------------------------ >>> Everyone hates slow websites. So do we. >>> Make your web apps faster with AppDynamics >>> Download AppDynamics Lite for free today: >>> http://p.sf.net/sfu/appdyn_d2d_feb >>> >>> >>> >>> _______________________________________________ >>> Fedora-commons-users mailing list >>> Fedora-commons-users@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/fedora-commons-users >>> >> >> -- >> -nruest >> >> ------------------------------------------------------------------------------ >> Everyone hates slow websites. So do we. >> Make your web apps faster with AppDynamics >> Download AppDynamics Lite for free today: >> http://p.sf.net/sfu/appdyn_d2d_feb >> _______________________________________________ >> Fedora-commons-users mailing list >> Fedora-commons-users@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/fedora-commons-users > > > ------------------------------------------------------------------------------ > Everyone hates slow websites. So do we. > Make your web apps faster with AppDynamics > Download AppDynamics Lite for free today: > http://p.sf.net/sfu/appdyn_d2d_feb > _______________________________________________ > Fedora-commons-users mailing list > Fedora-commons-users@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/fedora-commons-users > ------------------------------------------------------------------------------ Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the endpoint security space. For insight on selecting the right partner to tackle endpoint security challenges, access the full report. http://p.sf.net/sfu/symantec-dev2dev _______________________________________________ Fedora-commons-users mailing list Fedora-commons-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/fedora-commons-users