The code used for overlap detection within the arXiv corpus (see [1] which significantly extended earlier work [2]) does a matching based on a sliding window of hashed 7-word sequences on extracted ASCII text. Perhaps more the required for the case in question, but this approach scales to a corpus of 1M articles. I'm afraid it is a little finicky to compile on current systems given all the changes in C++ library organization since I wrote it in 2005/2009. I'm working off-and-on to tidy it up but haven't got there yet... So, FWIW, code at:

https://github.com/zimeon/docsim

Cheers,
Simeon

[1] http://arxiv.org/abs/1412.2716
[2] http://arxiv.org/abs/cs/0702012

On 1/23/15 9:44 AM, Mark A. Matienzo wrote:
I believe Turnitin and SafeAssign both compare the text of submissions to
against external sources (e.g., SafeAssign uses ABI/INFORM, among others).
I am not certain if they compare submissions against each other.

However, if you're looking for something along the lines of what Dre
suggests, you could use ssdeep, which is an implementation of a piecewise
hashing algorithm [0]. The issue with that you would have to assume that
all students would probably be using the same file format.

You could also using something like Tika to extract the text content from
all the submissions, and then compare them against each other.

[0] http://ssdeep.sourceforge.net/
[1] http://tika.apache.org/

Mark

--
Mark A. Matienzo <m...@matienzo.org>
Director of Technology, Digital Public Library of America

On Fri, Jan 23, 2015 at 8:47 AM, Andreas Orphanides <akorp...@ncsu.edu>
wrote:

My first thought was something like programatically doing a pairwise diff
of the files, 5500 times. I was surprised I couldn't find a utility that
just does this.

But i did find something called diffuse [1], that allows you to graphically
compare any number of text files in a diff-like fashion. This would
probably at least be able to tell you which files need closer scrutiny.

I think you'd presumably have to be able to extract the text from each
file; I doubt it would work on raw Word docs or PDFs, so that might be a
stopper.

It seems like the realm of source control has a lot of software designed to
help with this problem, so there might be other similar things out there.
But probably not anything designed to natively handle print-ready files.

-dre.


[1] http://diffuse.sourceforge.net/about.html

On Fri, Jan 23, 2015 at 7:26 AM, Judy Meirose <jmeir...@fcsl.edu> wrote:

Can anyone recommend a plagiarism checking software besides Turnitin and
SafeAssign?  I need to compare about 100 student assignments against each
other to make sure they don't copy each other's assignments.

Thanks.

Judy K. Meirose
Systems Librarian
Florida Coastal School of Law
8787 Baypine Rd
Jacksonville, FL
(904)680-7603

This email transmission, and any documents, files or previous e-mail
messages attached to it, may contain confidential, privileged and/or
proprietary information for the sole use of the intended recipient(s). If
you are not an intended recipient or a person responsible for delivering
it
to an intended recipient, any disclosure, copying, distribution or use of
any of the information contained in or attached to this transmission is
strictly prohibited. If you have received this transmission in error,
please: (1) immediately notify me by reply e-mail; and (2) destroy the
original (and any copies) of this transmission and its attachments
without
reading or saving in any manner.


Reply via email to