Greetings,

I was adding datatype classes to galaxy.datatype.text.py for our custom JSON 
formats and I noticed what I think is a problem with the Json datatype class, 
in particular the _looks_like_json method.  The basic algorithm is (in pseudo 
code):

        if the_file_is_small_enough:
                try to load the the file as json
                return True on success, False otherwise
        else
                while True
                        find first non-blank line
                        if line starts with ‘{‘ or ‘[‘:
                                return True
                        else
                                return False

However,  JSON is typically compressed by stripping whitespace (including 
newlines), particularly when it is fetched from a web service. This means that 
the first call to readLine is going to load the entire file, which defeats the 
purpose of checking the file size in the first place.

I would submit a pull request with our fix, but since I am not a Python 
programmer I thought I would simply post the code here for review.

class Json(Text):
        # Unchanged bits of the class have been omitted.

        # Add a function that reads a single character skipping over whitespace.
        def read(self, fileHandle):
                ch = fileHandle.read(1)
                while ch.isspace():
                        ch = fileHandle.read(1)
                return ch

        # Only the “else-part” has changed
        def _looks_like_json(self, filename):
                # Pattern used by SequenceSplitLocations
                if os.path.getsize(filename) < 50000:
                        # If the file is small enough - don't guess just check.
                        try:
                                json.load(open(filename, "r"))
                                return True
                        except Exception:
                                return False
                else:
                        with open(filename, "r") as fh:
                                ch = self.read(fh)
                                return ch == '{' or ch == '['

We then use the read() method to sniff out our JSON formats. Once we can sniff 
a header subclasses only need to specify the header.

        class Lapps( Json ):
                header = ‘’’{“discriminator”:”http://vocab.lappsgrid.org'''
                def sniff(self, filename)
                        with open(filename) as fh:
                                for c in header:
                                        if c != self.read(fh)
                                                return False
                        return True
        class Lif ( Lapps ):
                header = 
‘’’{“discriminator”:”http://vocab.lappsgrid.org/ns/media/jsonld#lif”'''
        class Gate( Lapps ):
                header = 
‘’’{“discriminator”:”http://vocab.lappsgrid.org/ns/media/gate”'''
        …

Regardless of how the JSON is rendered (pretty printed or not) we can match the 
“magic” strings used to identify our formats.

Being new to Python I don’t know how expensive it is to read a file one 
character at a time, but it has to be cheaper than potentially reading the 
entire file.

Cheers,
Keith

------------------------------
Research Associate
Department of Computer Science
Vassar College
Poughkeepsie, NY

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/

Reply via email to