Greetings,
I was adding datatype classes to galaxy.datatype.text.py for our custom JSON
formats and I noticed what I think is a problem with the Json datatype class,
in particular the _looks_like_json method. The basic algorithm is (in pseudo
code):
if the_file_is_small_enough:
try to load the the file as json
return True on success, False otherwise
else
while True
find first non-blank line
if line starts with ‘{‘ or ‘[‘:
return True
else
return False
However, JSON is typically compressed by stripping whitespace (including
newlines), particularly when it is fetched from a web service. This means that
the first call to readLine is going to load the entire file, which defeats the
purpose of checking the file size in the first place.
I would submit a pull request with our fix, but since I am not a Python
programmer I thought I would simply post the code here for review.
class Json(Text):
# Unchanged bits of the class have been omitted.
# Add a function that reads a single character skipping over whitespace.
def read(self, fileHandle):
ch = fileHandle.read(1)
while ch.isspace():
ch = fileHandle.read(1)
return ch
# Only the “else-part” has changed
def _looks_like_json(self, filename):
# Pattern used by SequenceSplitLocations
if os.path.getsize(filename) < 50000:
# If the file is small enough - don't guess just check.
try:
json.load(open(filename, "r"))
return True
except Exception:
return False
else:
with open(filename, "r") as fh:
ch = self.read(fh)
return ch == '{' or ch == '['
We then use the read() method to sniff out our JSON formats. Once we can sniff
a header subclasses only need to specify the header.
class Lapps( Json ):
header = ‘’’{“discriminator”:”http://vocab.lappsgrid.org'''
def sniff(self, filename)
with open(filename) as fh:
for c in header:
if c != self.read(fh)
return False
return True
class Lif ( Lapps ):
header =
‘’’{“discriminator”:”http://vocab.lappsgrid.org/ns/media/jsonld#lif”'''
class Gate( Lapps ):
header =
‘’’{“discriminator”:”http://vocab.lappsgrid.org/ns/media/gate”'''
…
Regardless of how the JSON is rendered (pretty printed or not) we can match the
“magic” strings used to identify our formats.
Being new to Python I don’t know how expensive it is to read a file one
character at a time, but it has to be cheaper than potentially reading the
entire file.
Cheers,
Keith
------------------------------
Research Associate
Department of Computer Science
Vassar College
Poughkeepsie, NY
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client. To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
https://lists.galaxyproject.org/
To search Galaxy mailing lists use the unified search at:
http://galaxyproject.org/search/mailinglists/