Thanks for the detailed bug report Keith! I have implemented Peter
suggestion as a Pull Request
(https://github.com/galaxyproject/galaxy/pull/309).

-John

On Thu, Jun 4, 2015 at 3:34 AM, Peter Cock <p.j.a.c...@googlemail.com> wrote:
> Good idea - the Python file read method takes an option
> number of bytes/characters, so it could easily read just
> one, or say 100 and then strip whitespace.
>
> Peter
>
> On Thu, Jun 4, 2015 at 3:36 AM, Keith Suderman <suder...@cs.vassar.edu> wrote:
>> Greetings,
>>
>> I was adding datatype classes to galaxy.datatype.text.py for our custom JSON
>> formats and I noticed what I think is a problem with the Json datatype
>> class, in particular the _looks_like_json method.  The basic algorithm is
>> (in pseudo code):
>>
>> if the_file_is_small_enough:
>> try to load the the file as json
>> return True on success, False otherwise
>> else
>> while True
>> find first non-blank line
>> if line starts with ‘{‘ or ‘[‘:
>> return True
>> else
>> return False
>>
>> However,  JSON is typically compressed by stripping whitespace (including
>> newlines), particularly when it is fetched from a web service. This means
>> that the first call to readLine is going to load the entire file, which
>> defeats the purpose of checking the file size in the first place.
>>
>> I would submit a pull request with our fix, but since I am not a Python
>> programmer I thought I would simply post the code here for review.
>>
>> class Json(Text):
>> # Unchanged bits of the class have been omitted.
>>
>> # Add a function that reads a single character skipping over whitespace.
>> def read(self, fileHandle):
>> ch = fileHandle.read(1)
>> while ch.isspace():
>> ch = fileHandle.read(1)
>> return ch
>>
>> # Only the “else-part” has changed
>> def _looks_like_json(self, filename):
>> # Pattern used by SequenceSplitLocations
>> if os.path.getsize(filename) < 50000:
>> # If the file is small enough - don't guess just check.
>> try:
>> json.load(open(filename, "r"))
>> return True
>> except Exception:
>> return False
>> else:
>> with open(filename, "r") as fh:
>> ch = self.read(fh)
>> return ch == '{' or ch == '['
>>
>> We then use the read() method to sniff out our JSON formats. Once we can
>> sniff a header subclasses only need to specify the header.
>>
>> class Lapps( Json ):
>> header = ‘’’{“discriminator”:”http://vocab.lappsgrid.org'''
>> def sniff(self, filename)
>> with open(filename) as fh:
>> for c in header:
>> if c != self.read(fh)
>> return False
>> return True
>> class Lif ( Lapps ):
>> header =
>> ‘’’{“discriminator”:”http://vocab.lappsgrid.org/ns/media/jsonld#lif”'''
>> class Gate( Lapps ):
>> header = ‘’’{“discriminator”:”http://vocab.lappsgrid.org/ns/media/gate”'''
>> …
>>
>> Regardless of how the JSON is rendered (pretty printed or not) we can match
>> the “magic” strings used to identify our formats.
>>
>> Being new to Python I don’t know how expensive it is to read a file one
>> character at a time, but it has to be cheaper than potentially reading the
>> entire file.
>>
>> Cheers,
>> Keith
>>
>> ------------------------------
>> Research Associate
>> Department of Computer Science
>> Vassar College
>> Poughkeepsie, NY
>>
>>
>> ___________________________________________________________
>> Please keep all replies on the list by using "reply all"
>> in your mail client.  To manage your subscriptions to this
>> and other Galaxy lists, please use the interface at:
>>   https://lists.galaxyproject.org/
>>
>> To search Galaxy mailing lists use the unified search at:
>>   http://galaxyproject.org/search/mailinglists/
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>   https://lists.galaxyproject.org/
>
> To search Galaxy mailing lists use the unified search at:
>   http://galaxyproject.org/search/mailinglists/
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/

Reply via email to