Apoorv,
Sorry for responding late, was occupied with some other work. Here’s my
response:
> Why do we need external-view, in addition to current-state and ideal-state?
Well, let’s take an example – say we have a resource (task), eg: a database
resource. In helix, you can partition a resource (parallelize the task) AND you
can replicate these partitions. Now let’s say we have a cluster with 3
participants (p1, p2, p3). So in our example, let’s say we decide to partition
this resource in 2 partitions (db_0, db_1), and have 3 replicas (db_0_p1,
db_0_p2, db_0_p3, similar for db_1_p*).
A current state is specific to a RESOURCE on a PARTICIPANT – {p1: db_0 =
MASTER, db_1 = SLAVE}. Whereas an external view is an aggregation of the
current state of a RESOURCE across all PARTICIPANTS – {db_0_p1 = MASTER,
db_0_p2 = SLAVE, db_0_p3 = SLAVE, …, db_1_p3 = SLAVE}.
If you notice, there is a subtle yet significant difference in the outputs.
Current state is equivalent to “what is the state of my resource X on
participant Y”. Whereas, external view is equivalent to “what is the state of
my resource X (across all participants)”. Why is this useful – answer follows
below.
> What is the purpose of a helix spectator?
Helix spectators are nodes (or agents) that are “watching on” a participant OR
resource; which ideally means they are created/deployed/designed to REACT to
any changes in the state of these resources. For example, you can have a
spectator who does REQUEST ROUTING – i.e. depending on the incoming request, it
needs to forward that request to an appropriate resource on an appropriate
participant. Now how does the spectator get the answer to the question “where
can I find a resource partition with MASTER state?”. The answer is EXTERNAL
VIEW – the spectator has an aggregated information about the resource across
all participants. PS: This routing example might not be an ideal one, but it’s
good to understand the spectators role in the system.
> Why a spectator with limited access, rather grant it full access?
Every role (participant, administrator, controller, spectator) has a meaning.
You would not want a spectator to control and make changes to the system;
that’s why we have the administrator. Similarly controllers are coded to
execute state transitions whenever a participant’s state diverges from the
ideal-state. A spectator is designed to be READ_ONLY (not literally).
Hope this answers your question and clears your doubts.
Thanks and Regards,
Gourav Shenoy
From: "Shenoy, Gourav Ganesh" <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Tuesday, July 18, 2017 at 5:47 AM
To: "[email protected]" <[email protected]>
Subject: Re: Helix + Mailing System
Hi Apoorv,
Thanks for the detailed email, and for nicely outlining your goals for the
coming weeks. Let me answer your questions pertaining to Helix in some time
(currently I'm travelling).
Keep up the good work, cheers!
Thanks and Regards,
Gourav Shenoy
On Jul 17, 2017 11:31 AM, Apoorv Palkar <[email protected]> wrote:
Hey Dev,
For the past 3-3.5 weeks, I've been investigating the use of Helix in Airavata
and been working on the email monitoring problem. I went through the
Curator/Zookeeper code to test out the internal workings of Helix. A particular
question I had was, what is the difference between external view and current
state? I understood that helix uses the resource model to maintain both the
ideal state and current state. Why is it necessary to have an external view? In
addition to this, what is the purpose of a spectator node. In the
documentation, it states that a "spectator" reacts to changes in a distributed
system. Why have the particular node have limited abilities when you can give
it full access? These questions may be highly important to consider when
writing the Helix paper for submission. As for the mailing/monitoring system, I
have decided to move forward with the JavaMail API + IMAP implementation. I
used the [email protected] (gmail) address as a basis for running my test
code. For this particular use case, I didn't use the Gmail API because it had
limited capabilities in terms of function/library uses. I played around with
the Gmail API, however, I was unsuccessful in getting it to work in a clean and
efficient manner. As such, I decided to use the JavaMail api provided via
imported libraries. IMAP was considered because it had greater capabilities
than POP3. POP3 was inefficient when fetching the emails. In terms of first
reading the emails, the first challenge was to set up the code correctly to
read from Gmail. Previously the issue was that the emails were being read every
time the read() function was called in the Inbox class. This meant that every
message would be pulled even if one email was unread. This proved to be highly
time costly as the scigap email address has 10000+ emails at any given time. I
set up boolean flags for email addresses that were read and ones that were
unread. As a result, all messages don't have to be pulled; only the ones with a
"false" flag need to be read. These messages were pulled and then put into a
Message[] array. This array was then compared using lambda expression as
JavaMail retrieves the most current message last. After these messages are put
into the array and dealt with, the messages are marked as "read" to avoid
reading them again. Currently, I'm working on improving the implementations of
all four email parsers. It is highly important to make sure these parsers run
effeciently as many emails would be read. I didn't want to use regex as it is
slightly slower than string operations. For my demo code, I have currently used
string operations to parse the subject title/content. In reality, an array or
StringBuilder class shoulder be used when implemented professionally to improve
on speed. Currently, I'm refactoring the PBS code to run a bit more optimally
and run test cases for the other two email types. Below is a link for the gmail
implementation + SLURM interpreter. Basically the idea is to have 4 classes
that handle each type and then proceed to parse the messages from the Message[]
array. The idea is to then take this COMMON data collected such as job_id,
name, status, time and then put it into a thrift data model file. Using this
thrift, then create a java thrift object to send over a AMPQ message queue,
RabbitMQ, to then potentially be used in a MySQL/SQL database. As of now, the
database part is not clear, but it would most likely a registery that needs to
be updated via use of Java JPA libary/SQL queries.
https://github.com/chessman179/gmailtestinged <<<<<<<<<<<<<
code.
** big shout out to Marcus --