Re: [sqlite] Memoization in sqlite json1 functions

2017-03-25 Thread Deon Brewis
> Page faults aren’t necessarily due to swapping. If you read a memory-mapped 
> file, the first access to any page will fault to disk. Since SQLite supports 
> memory-mapping, I’d assume this can occur during a query — the column data 
> returned by sqlite may point into mmap’ed pages. (Is that correct?)

> In that situation, if a blob value is significantly larger than one page, the 
> difference between scanning everything up to byte offset x, vs. reading a few 
> bytes at the beginning and then jumping directly to x, can be significant.

I don't think SQLITE will return a pointer directly into an mmap page offset, 
and it definitely can't in the case of a column spilled into an overflow page - 
the data will be discontiguous. 

Even something like the blob API's don't help, SQLITE overflow pages link to 
each other at the beginning of each page. So you can't e.g. load the 5th page 
without also loading page 1 through 4. Since SQLITE pages are generally smaller 
than disk pages, it means faulting in all the disk pages on the way.

Anyway, by faulting I meant 'paging back out due to memory pressure'. Sorry, 
should have been more clear - knew that one was going to bite me as soon as I 
hit send... 

How do you take a thread offline here?

- Deon

-Original Message-
From: sqlite-users [mailto:sqlite-users-boun...@mailinglists.sqlite.org] On 
Behalf Of Jens Alfke
Sent: Saturday, March 25, 2017 10:35 AM
To: SQLite mailing list <sqlite-users@mailinglists.sqlite.org>
Subject: Re: [sqlite] Memoization in sqlite json1 functions


> On Mar 24, 2017, at 4:48 PM, Deon Brewis <de...@outlook.com> wrote:
> 
> Yeah ok, if you take I/O hits then things like memory pre-fetching makes zero 
> difference. We're more in the business of "You take a page fault" == "You buy 
> more memory". Different level of performance requirements. (And glad that 
> SQLITE works well for both of us).

Page faults aren’t necessarily due to swapping. If you read a memory-mapped 
file, the first access to any page will fault to disk. Since SQLite supports 
memory-mapping, I’d assume this can occur during a query — the column data 
returned by sqlite may point into mmap’ed pages. (Is that correct?)

In that situation, if a blob value is significantly larger than one page, the 
difference between scanning everything up to byte offset x, vs. reading a few 
bytes at the beginning and then jumping directly to x, can be significant.

>> In Fleece I put a lot of effort into making the C++ API nice to use, so that 
>> I don’t have to have any other data structure. That's worked well so far.
> 
> Strong typing?

That only happens at higher levels in our stack. At the level I mostly work on, 
this is a document-oriented, schemaless data store. Up above there’s the option 
to use a data-modeling layer that binds document properties to platform object 
properties with strong typing (though we don’t support C++.) But even those 
bindings can be considerably more efficient than the typical JSON object tree — 
you’re not allocating string objects for dictionary keys, nor number objects 
for numeric/boolean values. It’s also generally faster to populate instance 
variables of objects (usually at fixed offsets) vs. adding key/value pairs to a 
hash table.

[I think we’re getting off topic, but I’ll be happy to continue offline.]

—Jens
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Memoization in sqlite json1 functions

2017-03-25 Thread Jens Alfke

> On Mar 24, 2017, at 4:48 PM, Deon Brewis  wrote:
> 
> Yeah ok, if you take I/O hits then things like memory pre-fetching makes zero 
> difference. We're more in the business of "You take a page fault" == "You buy 
> more memory". Different level of performance requirements. (And glad that 
> SQLITE works well for both of us).

Page faults aren’t necessarily due to swapping. If you read a memory-mapped 
file, the first access to any page will fault to disk. Since SQLite supports 
memory-mapping, I’d assume this can occur during a query — the column data 
returned by sqlite may point into mmap’ed pages. (Is that correct?)

In that situation, if a blob value is significantly larger than one page, the 
difference between scanning everything up to byte offset x, vs. reading a few 
bytes at the beginning and then jumping directly to x, can be significant.

>> In Fleece I put a lot of effort into making the C++ API nice to use, so that 
>> I don’t have to have any other data structure. That's worked well so far.
> 
> Strong typing?

That only happens at higher levels in our stack. At the level I mostly work on, 
this is a document-oriented, schemaless data store. Up above there’s the option 
to use a data-modeling layer that binds document properties to platform object 
properties with strong typing (though we don’t support C++.) But even those 
bindings can be considerably more efficient than the typical JSON object tree — 
you’re not allocating string objects for dictionary keys, nor number objects 
for numeric/boolean values. It’s also generally faster to populate instance 
variables of objects (usually at fixed offsets) vs. adding key/value pairs to a 
hash table.

[I think we’re getting off topic, but I’ll be happy to continue offline.]

—Jens
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Memoization in sqlite json1 functions

2017-03-24 Thread Deon Brewis
> It could be; my knowledge of optimization gets tenuous when it comes to 
> down-to-the-metal areas like CPU caching. But for large data, you run the 
> risk of blowing out the cache traversing it. And if the data is 
> memory-mapped, it becomes hugely faster to skip right to the relevant page 
> instead of faulting in every page ahead of it.

Yeah ok, if you take I/O hits then things like memory pre-fetching makes zero 
difference. We're more in the business of "You take a page fault" == "You buy 
more memory". Different level of performance requirements. (And glad that 
SQLITE works well for both of us).


> In Fleece I put a lot of effort into making the C++ API nice to use, so that 
> I don’t have to have any other data structure. That's worked well so far.

Strong typing?


- Deon

-Original Message-
From: sqlite-users [mailto:sqlite-users-boun...@mailinglists.sqlite.org] On 
Behalf Of Jens Alfke
Sent: Thursday, March 23, 2017 6:09 PM
To: SQLite mailing list <sqlite-users@mailinglists.sqlite.org>
Subject: Re: [sqlite] Memoization in sqlite json1 functions


> On Mar 23, 2017, at 3:17 PM, Deon Brewis <de...@outlook.com> wrote:
> 
> If you however can use a forward-only push or pull parser like a SAX or StAX 
> parse, it's a different story. I'm using a StAX-like pull parser for a binary 
> json-ish internal format we have, and reading & parsing through it is on par 
> with the performance of reading equivalent SQLITE columns directly

I agree that’s a lot faster, but you’re still looking at O(n) lookup time in an 
array or dictionary. And the proportion constant gets worse the bigger the 
document is, since jumping to the next item involves parsing through all of the 
nested items in that collection.

> That obviously implies if you do random-access into a structure you have to 
> keep reparsing it (which is where Memoization would be nice). However, CPU 
> caches are better at reading continues data streams in forward-only fashion 
> than they are with pointers, so forward-only pull parsers, even when you have 
> to repeat the entire parse, are often faster than the math behind it suggests.

It could be; my knowledge of optimization gets tenuous when it comes to 
down-to-the-metal areas like CPU caching. But for large data, you run the risk 
of blowing out the cache traversing it. And if the data is memory-mapped, it 
becomes hugely faster to skip right to the relevant page instead of faulting in 
every page ahead of it.

> Besides, in 99% of cases my users take the outcome from a json parse and just 
> store the results into a C++ data structure anyway. In that case the 
> intermediary object tree is just a throwaway and you may as well have built 
> the C++ structure up using a pull or push parser.

In Fleece I put a lot of effort into making the C++ API nice to use, so that I 
don’t have to have any other data structure. That's worked well so far.

—Jens

___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Memoization in sqlite json1 functions

2017-03-23 Thread Jens Alfke

> On Mar 23, 2017, at 3:17 PM, Deon Brewis  wrote:
> 
> If you however can use a forward-only push or pull parser like a SAX or StAX 
> parse, it's a different story. I'm using a StAX-like pull parser for a binary 
> json-ish internal format we have, and reading & parsing through it is on par 
> with the performance of reading equivalent SQLITE columns directly

I agree that’s a lot faster, but you’re still looking at O(n) lookup time in an 
array or dictionary. And the proportion constant gets worse the bigger the 
document is, since jumping to the next item involves parsing through all of the 
nested items in that collection.

> That obviously implies if you do random-access into a structure you have to 
> keep reparsing it (which is where Memoization would be nice). However, CPU 
> caches are better at reading continues data streams in forward-only fashion 
> than they are with pointers, so forward-only pull parsers, even when you have 
> to repeat the entire parse, are often faster than the math behind it suggests.

It could be; my knowledge of optimization gets tenuous when it comes to 
down-to-the-metal areas like CPU caching. But for large data, you run the risk 
of blowing out the cache traversing it. And if the data is memory-mapped, it 
becomes hugely faster to skip right to the relevant page instead of faulting in 
every page ahead of it.

> Besides, in 99% of cases my users take the outcome from a json parse and just 
> store the results into a C++ data structure anyway. In that case the 
> intermediary object tree is just a throwaway and you may as well have built 
> the C++ structure up using a pull or push parser.

In Fleece I put a lot of effort into making the C++ API nice to use, so that I 
don’t have to have any other data structure. That's worked well so far.

—Jens

___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Memoization in sqlite json1 functions

2017-03-23 Thread Deon Brewis
It has more to do with how you parse JSON - if you want to build it into an 
object tree, sure, then you're obviously dead in the water with any kind of 
json.

If you however can use a forward-only push or pull parser like a SAX or StAX 
parse, it's a different story. I'm using a StAX-like pull parser for a binary 
json-ish internal format we have, and reading & parsing through it is on par 
with the performance of reading equivalent SQLITE columns directly (apart from 
the expression indexed covered value lookup scenario... grumble... grumble...).

So binary-json like formats can perform well - you just can't use an object 
tree to parse them.

That obviously implies if you do random-access into a structure you have to 
keep reparsing it (which is where Memoization would be nice). However, CPU 
caches are better at reading continues data streams in forward-only fashion 
than they are with pointers, so forward-only pull parsers, even when you have 
to repeat the entire parse, are often faster than the math behind it suggests. 
(In the way that scanning an unsorted vector in O(n) is often times faster than 
searching O(log n) through a map).

Besides, in 99% of cases my users take the outcome from a json parse and just 
store the results into a C++ data structure anyway. In that case the 
intermediary object tree is just a throwaway and you may as well have built the 
C++ structure up using a pull or push parser. It's very like extra work, and 
it's way... way... faster. 

- Deon

-Original Message-
From: sqlite-users [mailto:sqlite-users-boun...@mailinglists.sqlite.org] On 
Behalf Of Jens Alfke
Sent: Thursday, March 23, 2017 11:05 AM
To: SQLite mailing list <sqlite-users@mailinglists.sqlite.org>
Subject: Re: [sqlite] Memoization in sqlite json1 functions


> On Mar 23, 2017, at 4:30 AM, Richard Hipp <d...@sqlite.org> wrote:
> 
> BLOBs are reserved for a future enhancement in which BLOBs will store the 
> binary encoding for JSON.

I’ve been down this road. There are a number of JSON-compatible binary encoding 
formats, but most of them don’t save much time, because (a) they’re not _that_ 
much faster to parse, (b) reading JSON tends to be dominated by allocating an 
object tree, not by the actual parsing, and (c) usually you have to parse the 
entire data even if you only want to use one piece of it [as in a query].

I ended up designing and implementing a new format, called Fleece*. Its 
advantage is that it doesn’t require parsing or even memory allocation. The 
internal structure is already an object tree, except that it uses compressed 
relative offsets instead of pointers. This means that internal pointers into 
Fleece data can be used directly as the data objects.

In my current project** we’re storing Fleece in SQLite instead of JSON, with a 
modified version of the json1 extension to make it accessible in queries. It 
works very well. The Fleece equivalent of json_extract( ) just does some 
pointer manipulation to find the root object in the blob, then further lookups 
to jump to each nested object in the path.

—Jens

* https://github.com/couchbaselabs/fleece 
<https://github.com/couchbaselabs/fleece>
** https://github.com/couchbase/couchbase-lite-core
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Memoization in sqlite json1 functions

2017-03-23 Thread Domingo Alvarez Duarte

Hello Jens !

Nice to know this project, I'll look at it.

Cheers !


On 23/03/17 15:05, Jens Alfke wrote:

On Mar 23, 2017, at 4:30 AM, Richard Hipp  wrote:

BLOBs are reserved for a future enhancement in which BLOBs will store the 
binary encoding for JSON.

I’ve been down this road. There are a number of JSON-compatible binary encoding 
formats, but most of them don’t save much time, because (a) they’re not _that_ 
much faster to parse, (b) reading JSON tends to be dominated by allocating an 
object tree, not by the actual parsing, and (c) usually you have to parse the 
entire data even if you only want to use one piece of it [as in a query].

I ended up designing and implementing a new format, called Fleece*. Its 
advantage is that it doesn’t require parsing or even memory allocation. The 
internal structure is already an object tree, except that it uses compressed 
relative offsets instead of pointers. This means that internal pointers into 
Fleece data can be used directly as the data objects.

In my current project** we’re storing Fleece in SQLite instead of JSON, with a 
modified version of the json1 extension to make it accessible in queries. It 
works very well. The Fleece equivalent of json_extract( ) just does some 
pointer manipulation to find the root object in the blob, then further lookups 
to jump to each nested object in the path.

—Jens

* https://github.com/couchbaselabs/fleece 

** https://github.com/couchbase/couchbase-lite-core
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Memoization in sqlite json1 functions

2017-03-23 Thread Jens Alfke

> On Mar 23, 2017, at 4:30 AM, Richard Hipp  wrote:
> 
> BLOBs are reserved for a future enhancement in which BLOBs will store the 
> binary encoding for JSON.

I’ve been down this road. There are a number of JSON-compatible binary encoding 
formats, but most of them don’t save much time, because (a) they’re not _that_ 
much faster to parse, (b) reading JSON tends to be dominated by allocating an 
object tree, not by the actual parsing, and (c) usually you have to parse the 
entire data even if you only want to use one piece of it [as in a query].

I ended up designing and implementing a new format, called Fleece*. Its 
advantage is that it doesn’t require parsing or even memory allocation. The 
internal structure is already an object tree, except that it uses compressed 
relative offsets instead of pointers. This means that internal pointers into 
Fleece data can be used directly as the data objects.

In my current project** we’re storing Fleece in SQLite instead of JSON, with a 
modified version of the json1 extension to make it accessible in queries. It 
works very well. The Fleece equivalent of json_extract( ) just does some 
pointer manipulation to find the root object in the blob, then further lookups 
to jump to each nested object in the path.

—Jens

* https://github.com/couchbaselabs/fleece 

** https://github.com/couchbase/couchbase-lite-core
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Memoization in sqlite json1 functions

2017-03-23 Thread Scott Hess
What is the goal, though?  Your app knows your data and performance needs,
so if you find yourself running the same query to read off the same result
set over and over, change your app to do the right thing.

If it's somehow more convenient to have SQLite do it, populate a temporary
table and pull the data from that, which doesn't require any new API at all.

-scott


On Thu, Mar 23, 2017 at 6:23 AM, Domingo Alvarez Duarte 
wrote:

> Hello Richard !
>
> Yes I do see "because the implementation parsing the JSON anew each time"
> and this is a recurring pattern in some sqlite functions, would be nice if
> we could have a "session/query/row" storage space to store query
> information that can be  reused, for example on the json functions we could
> reuse an already parsed json field several times, another example we could
> have session/query/row variables.
>
> We could have something like "sqlite3_set_auxdata" but with granularity
> for row/query/session and as a bonus would be nice to have session
> variables like https://dev.mysql.com/doc/refman/5.7/en/user-variables.html
> .
>
> Cheers !
>
> On 23/03/17 08:30, Richard Hipp wrote:
>
>> On 3/22/17, Domingo Alvarez Duarte  wrote:
>>
>>> Hello Richard !
>>>
>>> I noticed that sqlite do not use any memoization in json1 functions.
>>>
>>> For example jsonExtractFunc and others parse the json string every time
>>> it's called even when the json string is the same.
>>>
>>> minimal example : "select json_extract(json, '$.name') name,
>>> json_extract(json, '$.address') name from some_table;"
>>>
>>> Could be possible to have some kind of memoization as a general option
>>> for any sqlite functions ?'
>>>
>> In paragraph 3.0 of https://www.sqlite.org/json1.html:  "All json1
>> functions currently throw an error if any of their arguments are BLOBs
>> because BLOBs are reserved for a future enhancement in which BLOBs
>> will store the binary encoding for JSON."
>>
>> But let me ask this:  Have you actually measured a performance
>> problem?  Or are you just assuming that because the implementation
>> parses the JSON anew each time it see it that it must therefore be
>> inefficient?
>>
>>
> ___
> sqlite-users mailing list
> sqlite-users@mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Memoization in sqlite json1 functions

2017-03-23 Thread Domingo Alvarez Duarte

Hello Richard !

Yes I do see "because the implementation parsing the JSON anew each 
time" and this is a recurring pattern in some sqlite functions, would be 
nice if we could have a "session/query/row" storage space to store query 
information that can be  reused, for example on the json functions we 
could reuse an already parsed json field several times, another example 
we could have session/query/row variables.


We could have something like "sqlite3_set_auxdata" but with granularity 
for row/query/session and as a bonus would be nice to have session 
variables like https://dev.mysql.com/doc/refman/5.7/en/user-variables.html .


Cheers !

On 23/03/17 08:30, Richard Hipp wrote:

On 3/22/17, Domingo Alvarez Duarte  wrote:

Hello Richard !

I noticed that sqlite do not use any memoization in json1 functions.

For example jsonExtractFunc and others parse the json string every time
it's called even when the json string is the same.

minimal example : "select json_extract(json, '$.name') name,
json_extract(json, '$.address') name from some_table;"

Could be possible to have some kind of memoization as a general option
for any sqlite functions ?'

In paragraph 3.0 of https://www.sqlite.org/json1.html:  "All json1
functions currently throw an error if any of their arguments are BLOBs
because BLOBs are reserved for a future enhancement in which BLOBs
will store the binary encoding for JSON."

But let me ask this:  Have you actually measured a performance
problem?  Or are you just assuming that because the implementation
parses the JSON anew each time it see it that it must therefore be
inefficient?



___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Memoization in sqlite json1 functions

2017-03-23 Thread Richard Hipp
On 3/22/17, Domingo Alvarez Duarte  wrote:
> Hello Richard !
>
> I noticed that sqlite do not use any memoization in json1 functions.
>
> For example jsonExtractFunc and others parse the json string every time
> it's called even when the json string is the same.
>
> minimal example : "select json_extract(json, '$.name') name,
> json_extract(json, '$.address') name from some_table;"
>
> Could be possible to have some kind of memoization as a general option
> for any sqlite functions ?'

In paragraph 3.0 of https://www.sqlite.org/json1.html:  "All json1
functions currently throw an error if any of their arguments are BLOBs
because BLOBs are reserved for a future enhancement in which BLOBs
will store the binary encoding for JSON."

But let me ask this:  Have you actually measured a performance
problem?  Or are you just assuming that because the implementation
parses the JSON anew each time it see it that it must therefore be
inefficient?

-- 
D. Richard Hipp
d...@sqlite.org
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Memoization in sqlite json1 functions

2017-03-23 Thread Deon Brewis
That would be nice.

I've resorted to a few horrible hacks like this:

SELECT parse(data), extract("name"), extract("address"), release(data) FROM 
some_table;

It works, but it relies on LTR parsing of arguments (which it does now, but I 
seriously doubt is a guarantee), as well as global/thread-local variables.

Though I can still live with that one - it works.

What I would like to see - is if you have an indexed expression like so:
create index some_index on some_table( json_extract(json, '$.name') );

And you run:
select json_extract(json, '$.name') from some_table indexed by some_index;

That it returns the resultant value that is already stored in the index, rather 
than re-executing the expression.
 
- Deon

-Original Message-
From: sqlite-users [mailto:sqlite-users-boun...@mailinglists.sqlite.org] On 
Behalf Of Domingo Alvarez Duarte
Sent: Wednesday, March 22, 2017 5:04 PM
To: SQLite mailing list 
Subject: [sqlite] Memoization in sqlite json1 functions

Hello Richard !

I noticed that sqlite do not use any memoization in json1 functions.

For example jsonExtractFunc and others parse the json string every time it's 
called even when the json string is the same.

minimal example : "select json_extract(json, '$.name') name, json_extract(json, 
'$.address') name from some_table;"

Could be possible to have some kind of memoization as a general option for any 
sqlite functions ?

Cheers !


___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users