Hi Dan,

Thanks for the reply. 

On Tue, 2008-10-07 at 11:47 -0400, Dan Bron wrote:
> Alex Rufon wrote:
> >  A few years back, I've asked this same forum on how to
> >  convert strings to numbers and back again. 
> 
> I remember these discussions.  Oleg suggested  s2i =:  6 s: s:  [1], and in a 
> follow up a few years later, you responded that you'd tried the suggestion, 
> but it didn't meet your needs because of the memory constraints imposed by 
> your architecture [2].
My architecture has changed. Instead of having 1 J instance on a server,
I now run J instances on the client where I bring down the data. Still,
I can't use this solution because it uses symbols. The primary use of
this method is to work on MS-SQL data like this:
   SIZESH
+----------------+-------+
|Size_Code       |text   |
+----------------+-------+
|Size_Schema_code|text   |
+----------------+-------+
|Size_Code_Desc  |text   |
+----------------+-------+
|Size_Seq_No     |numeric|
+----------------+-------+
   SIZESD
+--+----+---+-+
|XS|G006|XS |1|
+--+----+---+-+
|S |G006|S  |2|
+--+----+---+-+
|M |G006|M  |3|
+--+----+---+-+
|L |G006|L  |4|
+--+----+---+-+
|XL|G006|XL |5|
+--+----+---+-+
   SIZESV
36 41 42 1
37 41 37 2
38 41 38 3
39 41 39 4
40 41 40 5
   
As you can see, SIZESH tells us the field names and data types of the
SIZES table. SIZESD is the actual data that came from MS-SQL, but since
we don't play around with boxes, the data is converted into a numeric
array. 
   {."1 SIZESD
+--+-+-+-+--+
|XS|S|M|L|XL|
+--+-+-+-+--+
   {."1 SIZESV
36 37 38 39 40
   LOOKUP {~ {."1 SIZESV
+--+-+-+-+--+
|XS|S|M|L|XL|
+--+-+-+-+--+
   
Basically, the conversion is needed to allow us to manipulate data as a
two dimensional numeric matrix
> 
> >  tricky part was going back again from number to the original string.
> 
> Yes.  If you only needed unique numbers from unique strings, the solution 
> would be simpler (  128!:3 'string'  is a fun one).

You know, I became a bit excited about 128!:3 but then I found out that
I can't get the original string. 
> 
> The problem is that if your domain is unbounded (i.e. the input strings are 
> arbitrary and there are no constraints on their content or length), and you 
> need a 1:1 mapping, then your range is unbounded too.  That is, the strings 
> are as efficient a representation as you're going to get.  
> 
> Now, if you're not using numbers to increase efficiency, and you require the 
> ability to (for example) stitch string-identifiers onto a homogeneous array 
> of numbers, then you could do something like:
> 
>          s2i  =:  (x:#a.) #. a.&i.
> 
>          s2i  'Alex Rufon'
>       308953381376021135519598
> 
>          s2i^:_1 s2i  'Alex Rufon'
>       Alex Rufon
> 
> Essentially, this interprets your strings as numbers in base 256.  But as I 
> said, this only works if it you don't care how big the numbers ouput are 
> (i.e. if you don't need to limit your range).  You can't get a more efficient 
> representation this way.  In fact, the representation is worse (and the 
> mapping will be slow for large inputs):
> 
>          a  =:  'Alex Rufon'
>          b  =:  s2i a
>          
>          7!:5 ;:'a b'
>       64 128
> 
> So the short answer is, if you require an unbounded domain but a restricted 
> range, someone's going to have to store your input (to make the mapping 
> invertible).  AFAIK, there are only two ways to do that in J.  There's the  
> s:  method Oleg suggested (where J stores the input strings behind the 
> scenes), and  there's doing it yourself:
> 
> >  LOOKUP_z_=: '';'0';'1';'2';'3';'4';'5';'6';'7';'8';'9';' '
> 
> So if you don't like this, we need to know why, and what it is you do want.  
> You'll need to describe the:
> 
> (A)  Bounds on your domain.  Describe the strings:  are they limited in any 
> way?  Are they a fixed length?  Is there a maximum length?  Is there a 
> limited universe of characters from which they can be composed?  How many do 
> you process in a batch?   Is there a reason you can't use your database to 
> map them to integers outside of J (e.g. using the [presumably autogenerated 
> integral] primary key of the table they came from)?
The string data comes from MS-SQL so there are no bounds until I ran out
of memory. But this will never happen because I designed the process to
work on the smallest possible set of data.

Most of the data are actually GUID's and autoid's automatically
generated by the database. 
> 
> (B)  Constraints on your range.  Why do you need numbers?  Are you just 
> looking for a more efficient representation of your strings?  Do you need to 
> attach the string (identifiers) to other numbers?  Are you trying to avoid 
> boxing (if so, why)?  Must the numbers be positive integers?  Can they be 
> negative, float, rational, complex?  Is there a limit to how large they can 
> be?  How do you use them?  
I believe your right when you asked the I just need numbers as efficient
representation of my strings. The numbers has no constraints besides the
requirement that it should be a single number be it may negative or
floating.

They are used basically so that we can work with both string and numbers
as 1 numeric matrix. Allowing for searching, filtering, and other
manipulation without working with boxes.

This was actually not my idea. It was part of the design requirement put
in by my boss who was an APL programmer.
> 
> (C)  Constraints on the mapping (aside from those on the range).  Are you 
> still using the architecture described in [2]?  How much memory can a single 
> J instance use before you start running into performance problems?  How long 
> should a mapping take?  Is there a time or memory limit it must not exceed?
Oh, as I've said above, I've changed the underlying architecture. From 1
HUGE/POWERFUL application server running only J to moving the processing
to the client desktop. The 1GB limit still exist because the client
machines are still 32bit but judicious review of input requirement
enabled J to work on the smallest set of data possible. :) 
> 
> (D)  Reasons for not liking a lookup array.  I presume your application is 
> not totally functional (in the sense that Jose's applications are totally 
> functional), and that the extra global noun isn't necessarily a wart.  You 
> brought this up in the context of optimizing your application, so do you find 
> that  LOOKUP_z_ i. y  is not fast or lean enough?  (If so, since  i. is 
> highly optimized, I don't know if you're going to find a more efficient 
> solution.)
Its not that I don't like the lookup array. Its just that under JPM, for
a process that runs for 1.069 secods, searchLOOKUP runs 3% of the time
at 150 repetitions. I am just concerned with scaling the operation to
process larger purchase orders. 
> 
> I hope this helps,
> 
> -Dan
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm

In the end, I really appreciate it when you asked your questions and
pointed out concepts and ideas. When you said that i. is optimized
enough and it may be the best option that is available to me ... i felt
relieved. :) 

Thanks again.

r/Alex

-- 
"The right questions are more important than the right answers to the
wrong questions."
-Dr. John Romagna
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to