Re: [GENERAL] json indexing and data types

2015-12-16 Thread Kaare Rasmussen

Hi Oleg

This is known problem, that's why we stop developing jsquery and are 
working on sql-level query language for jsonb, then you'll use all 
power and extendability of SQL.  The idea is to use power of 
subselects and unnest to unroll jsonb to sql level.

There is presentation at pgconf.eu  on this
https://wiki.postgresql.org/images/4/4e/Createam.pdf, see slide #27


This is very interesting. Thanks for the update. And to all who answered 
this topic, sorry for awoling. I just got busy, but thanks for all the 
replies, I got something to think about.



But I'm afraid it'll come to 9.6.


I'll hope it comes in 9.6. I'll definitely look forward to that.

/kaare


Re: [GENERAL] json indexing and data types

2015-12-12 Thread Oleg Bartunov
On Wed, Dec 2, 2015 at 11:48 AM, Kaare Rasmussen  wrote:

> Hi
>
> As json essentially only has three basic data types, string, int, and
> boolean, I wonder how much of this - to index, search, and sort on
> unstructured data -  is possible. I guess part of the answer would be
> 'jsquery and vodka', but let me describe the problem first.
>
> The basics is, that I have a column with what is essentially json data; a
> number of data structures of different depths. Perhaps 10 - 30 top levels,
> and probably no more than 3, max 4 levels deep. In total there are some
> hundred thousands of rows in each table. It would probably be best stored
> as jsonb. Right now it's text, because it's only used by the application
> itself.
>
> It would be incredibly useful to add an index to this column, and to be
> able to search, using the index, on arbitrary elements. This part seems
> already there, with jsquery.
>
> The hard part is that some of the data items really have another type.
> There are dates and floating points, as the most important ones. And the
> really hard part is that sorting and range searches are important,
> especially for these two types. Having dates is iso-format, and
> left-padding floats with zeros is a low tech solution, and especially the
> latter is not very efficient.
>
> The solution might be to add functional indexes for these data items, but
> it's cumbersome and not easily maintainable. If a one-stop solution is in
> the works, or already there, it could save a lot of time.
>
>
This is known problem, that's why we stop developing jsquery and are
working on sql-level query language for jsonb, then you'll use all power
and extendability of SQL.  The idea is to use power of subselects and
unnest to unroll jsonb to sql level.
There is presentation at pgconf.eu on this
https://wiki.postgresql.org/images/4/4e/Createam.pdf, see slide #27

But I'm afraid it'll come to 9.6.




> /kaare
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>


Re: [GENERAL] json indexing and data types

2015-12-03 Thread Merlin Moncure
On Wed, Dec 2, 2015 at 10:04 PM, Tom Lane  wrote:
> (Having said that, it sure looks to me like JSON's idea of a number is
> float/numeric, not merely int.  Are you sure you need more capability
> in that department, and if so what exactly?)

Numeric range searching is good, but the numeric case isn't compelling
for me.  Text string searching *is* compelling but is not as of yet
supported by jsquery.  Arbitrary token searching would be ideal, but
I'd settle for left to to right matching.

Historically, GIN index searching with over jsquery with ranges or
pg_trgm has also given me heartburn with its "worst case" performance
behavior in that it could in depressingly common cases underperform
(sometimes grossly) brute force.   This doesn't mesh well with the
'user supplied list of search terms' type of searching that we do a
lot of.

The situation of late for pg_trgm has gotten drastically better with
the triconsistent API optimizations.  I haven't gotten around yet to
seeing if any of that magic has been sprinkled on jsquery.

merlin


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] json indexing and data types

2015-12-03 Thread Merlin Moncure
On Wed, Dec 2, 2015 at 7:11 PM, Jim Nasby  wrote:
> On 12/2/15 7:06 PM, Merlin Moncure wrote:
>>
>>  > The basics is, that I have a column with what is essentially json data;
>> a
>>  > number of data structures of different depths. Perhaps 10 - 30 top
>> levels,
>>  > and probably no more than 3, max 4 levels deep. In total there are some
>>  > hundred thousands of rows in each table. It would probably be best
>> stored as
>>  > jsonb. Right now it's text, because it's only used by the application
>>  > itself.
>
>
> After re-reading this part...
>
> Are you in control of the JSON itself, and are the number of permutations
> known in advance? It might be that something like table inheritance is a
> better solution...

Yeah, or other data storage strategies that are btree friendly but not
'schema rigid', like EAV (perhaps a modified variant to support
storing the document structure).  There are definitely tradeoffs
involved but you have to consider all the options.

merlin


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] json indexing and data types

2015-12-03 Thread Kaare Rasmussen

On 2015-12-03 05:04, Tom Lane wrote:
Yeah. The problem here is that a significant part of the argument for 
the JSON/JSONB datatypes was that they adhere to standards (RFC 7159 
in particular). I can't see us accepting a patch that changes them 
into JSON-plus-some-PG-enhancements.


Would be nice for my specific need, but probably wouldn't do a lot of 
good in the long run.


(Having said that, it sure looks to me like JSON's idea of a number is 
float/numeric, not merely int. Are you sure you need more capability 
in that department, and if so what exactly?) 


Hmm, I think you're right, having just tried some conversions. But 
still, I would have to rely on vodka (or similar) knowing this, I guess?


/kaare


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] json indexing and data types

2015-12-03 Thread Jim Nasby

On 12/2/15 10:38 PM, Kaare Rasmussen wrote:

On 2015-12-03 01:04, Jim Nasby wrote:

We have a client that has a similar (though also a bit different)
need. Specifically, they get an XML document that has element
attributes that tell you what data type the element should contain. We
convert the XML to JSON (easy thanks to plpython), which produces a
bunch of nested JSON objects (typed as specifically as possible in
JSON). The XML attributes get turned into items in an object. So


OK, AFAIUI, you added the schema to each row. I think that I have fewer
variations, so perhaps the information would live better outside, but
that's a detail. Turning them into tables and views is a good way to
represent the indexable data. Functionally, it seems to me to be almost
the same as functional indexing, but much more transparent, and easier
to write a query for,


I didn't add the schema; in this case the schema was always the same. If 
you had a limited number of schemas you could indicate which one was in 
a particular document and use the appropriate decoding.



Are you in control of the JSON itself, and are the number of
permutations known in advance? It might be that something like table
inheritance is a better solution...


Yes, I can alter the db specification. Not sure how table inheritance
would help, though?


They provide a means where you can refer to the common parts of 
disparate schemas in one place, while being able to deal with the inner 
details on each child table.


It might not be useful depending on what your goals are. I mentioned it 
because I think most people only think of inheritance as "That weird 
thing that partitioning uses."

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] json indexing and data types

2015-12-03 Thread Jeff Janes
On Wed, Dec 2, 2015 at 8:04 PM, Tom Lane  wrote:
> Merlin Moncure  writes:
>> On Wed, Dec 2, 2015 at 12:03 AM, Kaare Rasmussen  wrote:
>>> As json essentially only has three basic data types, string, int, and
>>> boolean, I wonder how much of this - to index, search, and sort on
>>> unstructured data -  is possible.
>
>> I feel your pain.  jsquery is superb for subdocument searching on
>> *specific* subdocuments but range searching is really limited.
>
> Yeah.  The problem here is that a significant part of the argument for
> the JSON/JSONB datatypes was that they adhere to standards (RFC 7159 in
> particular).  I can't see us accepting a patch that changes them into
> JSON-plus-some-PG-enhancements.
>
> For cases where you know that specific sub-fields can be expected to be
> of particular datatypes, I think you could get a lot of mileage out of
> functional indexes ... but you'd have to write your queries to match the
> indexes, which could be painful.

If you create a view which has columns defined according to the index
expression, it does remove a lot of the pain of making queries that
use those expressions.  It looks just like using a real column, as
long as you don't update it.

Cheers,

Jeff


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] json indexing and data types

2015-12-02 Thread Jim Nasby

On 12/2/15 7:06 PM, Merlin Moncure wrote:

 > The basics is, that I have a column with what is essentially json data; a
 > number of data structures of different depths. Perhaps 10 - 30 top
levels,
 > and probably no more than 3, max 4 levels deep. In total there are some
 > hundred thousands of rows in each table. It would probably be best
stored as
 > jsonb. Right now it's text, because it's only used by the application
 > itself.


After re-reading this part...

Are you in control of the JSON itself, and are the number of 
permutations known in advance? It might be that something like table 
inheritance is a better solution...

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] json indexing and data types

2015-12-02 Thread Merlin Moncure
On Wed, Dec 2, 2015 at 12:03 AM, Kaare Rasmussen > wrote:
> Hi
>
> As json essentially only has three basic data types, string, int, and
> boolean, I wonder how much of this - to index, search, and sort on
> unstructured data -  is possible. I guess part of the answer would be
> 'jsquery and vodka', but let me describe the problem first.
>
> The basics is, that I have a column with what is essentially json data; a
> number of data structures of different depths. Perhaps 10 - 30 top levels,
> and probably no more than 3, max 4 levels deep. In total there are some
> hundred thousands of rows in each table. It would probably be best stored
as
> jsonb. Right now it's text, because it's only used by the application
> itself.
>
> It would be incredibly useful to add an index to this column, and to be
able
> to search, using the index, on arbitrary elements. This part seems already
> there, with jsquery.
>
> The hard part is that some of the data items really have another type.
There
> are dates and floating points, as the most important ones. And the really
> hard part is that sorting and range searches are important, especially for
> these two types. Having dates is iso-format, and left-padding floats with
> zeros is a low tech solution, and especially the latter is not very
> efficient.
>
> The solution might be to add functional indexes for these data items, but
> it's cumbersome and not easily maintainable. If a one-stop solution is in
> the works, or already there, it could save a lot of time.

I feel your pain.  jsquery is superb for subdocument searching on
*specific* subdocuments but range searching is really limited.  Value
searching is there for numerics but dates and text range searching are not
present.  We also have to understand that you are asking the index to make
assumptions about the json that are not clear from the structure itself
(such as subfield 'x' is a date).

The only workaround I've been able to come up with is to migrate the json
to a specially encoded text field, stored side by side with the source
json, that is more amenable to pg_trgm based searching (to give you a taste
of that complexity, keys are stored upper case and values are stored lower
case).

Some might say that you're better off using a dedicated json searching
server like solr but these systems aren't magic; they will quickly boil
down to a brute force search in the face of complex queries, and they have
lots of other problems in my experience (starting with, lack of proper
transactions and painfully slow insertion of large documents).  Other
people recommend them; I don't.

One way of looking at this problem is that the "schemaless" check is
getting cashed. If you need detailed data driven queries (as opposed to
more 'test searchy' type searches) perhaps it's time to start running your
data through a normalized structure.

merlin


Re: [GENERAL] json indexing and data types

2015-12-02 Thread Kaare Rasmussen

On 2015-12-03 01:04, Jim Nasby wrote:
We have a client that has a similar (though also a bit different) 
need. Specifically, they get an XML document that has element 
attributes that tell you what data type the element should contain. We 
convert the XML to JSON (easy thanks to plpython), which produces a 
bunch of nested JSON objects (typed as specifically as possible in 
JSON). The XML attributes get turned into items in an object. So


OK, AFAIUI, you added the schema to each row. I think that I have fewer 
variations, so perhaps the information would live better outside, but 
that's a detail. Turning them into tables and views is a good way to 
represent the indexable data. Functionally, it seems to me to be almost 
the same as functional indexing, but much more transparent, and easier 
to write a query for,


Are you in control of the JSON itself, and are the number of 
permutations known in advance? It might be that something like table 
inheritance is a better solution...


Yes, I can alter the db specification. Not sure how table inheritance 
would help, though?


/kaare


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] json indexing and data types

2015-12-02 Thread Kaare Rasmussen

On 2015-12-03 02:06, Merlin Moncure wrote:
I feel your pain.  jsquery is superb for subdocument searching on 
*specific* subdocuments but range searching is really limited.  Value 
searching is there for numerics but dates and text range searching are 
not present.  We also have to understand that you are asking the index 
to make assumptions about the json that are not clear from the 
structure itself (such 


I thought that text range searching (using indexes) was available in pg, 
at least with vodka ?
Some might say that you're better off using a dedicated json searching 
server like solr but these systems aren't magic; they will quickly 
boil down to a brute force search in the face of complex queries, and 
they have lots of other problems in my experience (starting with, lack 
of proper transactions and painfully slow insertion of large 
documents).  Other people recommend them; I don't.


They come with their own set of problems. Including not being able to be 
part of a where clause. The json data may not be the only thing you want 
to limit your selection with.


/kaare



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] json indexing and data types

2015-12-02 Thread Tom Lane
Merlin Moncure  writes:
> On Wed, Dec 2, 2015 at 12:03 AM, Kaare Rasmussen  wrote:
>> As json essentially only has three basic data types, string, int, and
>> boolean, I wonder how much of this - to index, search, and sort on
>> unstructured data -  is possible.

> I feel your pain.  jsquery is superb for subdocument searching on
> *specific* subdocuments but range searching is really limited.

Yeah.  The problem here is that a significant part of the argument for
the JSON/JSONB datatypes was that they adhere to standards (RFC 7159 in
particular).  I can't see us accepting a patch that changes them into
JSON-plus-some-PG-enhancements.

For cases where you know that specific sub-fields can be expected to be
of particular datatypes, I think you could get a lot of mileage out of
functional indexes ... but you'd have to write your queries to match the
indexes, which could be painful.

(Having said that, it sure looks to me like JSON's idea of a number is
float/numeric, not merely int.  Are you sure you need more capability
in that department, and if so what exactly?)

regards, tom lane


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] json indexing and data types

2015-12-02 Thread Jim Nasby

On 12/2/15 12:03 AM, Kaare Rasmussen wrote:


The hard part is that some of the data items really have another type.
There are dates and floating points, as the most important ones. And the
really hard part is that sorting and range searches are important,
especially for these two types. Having dates is iso-format, and
left-padding floats with zeros is a low tech solution, and especially
the latter is not very efficient.

The solution might be to add functional indexes for these data items,
but it's cumbersome and not easily maintainable. If a one-stop solution
is in the works, or already there, it could save a lot of time.


We have a client that has a similar (though also a bit different) need. 
Specifically, they get an XML document that has element attributes that 
tell you what data type the element should contain. We convert the XML 
to JSON (easy thanks to plpython), which produces a bunch of nested JSON 
objects (typed as specifically as possible in JSON). The XML attributes 
get turned into items in an object. So


42

becomes something like

"some-element": { "@type": "integer", "#text": 42 }

Some transforms are applied to that (like replacing - with _), and the 
resulting JSON is used to create a set of tables, where each table 
contains one level of nesting (triggers to handle inserts are also 
created). Finally, views that break out each element value are created 
on top of these tables. If specific type info is available that's used 
to determine the type of the column, otherwise an appropriate type is 
chosen based on json_typeof(). This results in a view that looks 
something like


SELECT ((json_data-> 'some_element')->>'#text')::integer AS some_element

The reason we went with one table per level was to allow full indexing 
(without needing Vodka), because it made the code easier to develop (at 
least during initial conception), and because it supports joining 
between the views nicely (something we needed). You could probably do 
this without splitting into multiple tables, but I suspect it would add 
significant complexity to the view creation.


I'd like to eventually open source the guts of this, but unfortunately 
there's a lot of work required to get it to the point where that would 
be possible. There's also some choices that were made that in retrospect 
should probably be done differently. Of course if someone wanted to pay 
us to do that then we'll find the time ;). Short of that if someone is 
really serious about helping with that effort I can start untangling 
parts of this from the proprietary codebase that it's currently buried 
in, but even that would be a pretty significant effort.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] json indexing and data types

2015-12-01 Thread Kaare Rasmussen

Hi

As json essentially only has three basic data types, string, int, and 
boolean, I wonder how much of this - to index, search, and sort on 
unstructured data -  is possible. I guess part of the answer would be 
'jsquery and vodka', but let me describe the problem first.


The basics is, that I have a column with what is essentially json data; 
a number of data structures of different depths. Perhaps 10 - 30 top 
levels, and probably no more than 3, max 4 levels deep. In total there 
are some hundred thousands of rows in each table. It would probably be 
best stored as jsonb. Right now it's text, because it's only used by the 
application itself.


It would be incredibly useful to add an index to this column, and to be 
able to search, using the index, on arbitrary elements. This part seems 
already there, with jsquery.


The hard part is that some of the data items really have another type. 
There are dates and floating points, as the most important ones. And the 
really hard part is that sorting and range searches are important, 
especially for these two types. Having dates is iso-format, and 
left-padding floats with zeros is a low tech solution, and especially 
the latter is not very efficient.


The solution might be to add functional indexes for these data items, 
but it's cumbersome and not easily maintainable. If a one-stop solution 
is in the works, or already there, it could save a lot of time.


/kaare


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general