Re: [GSoC 2026] Backup/restore utility for AsterixDB

Ian Maxon Fri, 27 Mar 2026 10:59:09 -0700

Hey Mohamed,
Great to hear from you again. Sorry for taking a bit to reply.
Excellent prototype, it's exactly what I had in mind. Very nice work.
I checked it out and ran it and it worked perfectly. Using Jackson for
JSON handling makes perfect sense, it's used very heavily in AsterixDB
internally as well for a variety of things.
Those next steps sound great. You might find this document useful for
the low level details of the data model
https://cwiki.apache.org/confluence/display/ASTERIXDB/AsterixDB+Object+Serialization+Reference
ADM is basically an extension of JSON. I think even that document is a
bit out of date- we also support 'geometry' type which is a GeoJSON
field.
Some kind of integration test also sounds good. It's always a bit
tricky coordinating things between two projects that depend on one
another, so feel free to ask about any questions or difficulties you
come across in that.

To answer your two questions:
1. Sure, you can send the proposal draft to me and I'd be happy to
give it a look.
2. I would defer to the GSOC guide about this, but I think it means
350 hours over the 12 weeks of the program. It doesn't have to be 8
hours every day, but it should be about that much.
I put this project as "large" because even though it's conceptually
straightforward, there is a lot of surface area that is important to
get right. I think the code in the area of Metadata management is also
kind of convoluted and hard to read. I didn't want there to be time
pressure when there's a lot of sticky details.
If it ends up being a bit easier than it seems, there are some
adaptations or extensions to the project that could easily fill up the
rest of the time. For example, if it seems like the translator is
working perfectly externally,
then the next step could be integrating it into the main codebase as a
datasource or special function that returns everything. There's also
many variants in terms of how to represent the backups of datasets.
The most
straightforward, conceptually, is to have inserts for each record.
However INSERT statements have poor performance in AsterixDB for a
variety of reasons. Therefore one improvement could be to dump
datasets as
JSONL files, and then have the DDLs to load them instead COPY FROM
statements, or LOAD statements.

Best,
- Ian

On Fri, Mar 27, 2026 at 6:34 AM Mohamed Hossam <[email protected]> wrote:
>
> Hey everyone,
>
> I'm Mohamed Hossam, a recent CS graduate who's interested in database
> systems. I'm currently writing a proposal for the "Backup/restore utility
> for AsterixDB [ASTERIXDB-3697
> <https://issues.apache.org/jira/browse/ASTERIXDB-3697>]" project for GSoC
> 2026. I emailed the potential mentor for this project a couple of weeks
> ago, and he instructed me to run AsterixDB locally and investigate the code
> in asterixdb-metadata. So, I did as suggested by my mentor and I'm excited
> to share what I learnt.
>
> I managed to create a very minimal prototype that can query AsterixDB and
> generate basic CREATE and INSERT statements from its data. You can find my
> work at: [m0hossam/asterixdb-dump
> <https://github.com/m0hossam/asterixdb-dump/>]. I used FasterXML's Jackson
> JSON parser and tried to recreate the metadata objects from the parsed
> JSON. Of course, this is only a proof of concept, I'm deliberately ignoring
> complex statements just to get the prototype up and running to test
> the feasibility of the project. The actual implementation will require a
> deeper understanding of AsterixDB's metadata and catalog.
>
> My next steps would be:
>
>    - Dive deeper into asterixdb-metadata and gain a better understanding of
>    the data model.
>    - Potentially contribute to AsterixDB if I find something to improve in
>    the relevant code areas.
>    - Write automated unit tests to compare queries from the original
>    database with queries from the database generated by my prototype's dump,
>    ensuring database integrity.
>
>
> I have two questions regarding the project:
>
>    1. Should I send my technical proposal to the potential mentor of the
>    project for review before submitting it through the official website?
>    2. The project size is supposedly "~350 hour (large)". What does this
>    mean in terms of time commitment? Will the project have an extended
>    timeline? Or does the project require 8 hours of work per day during the 3
>    months of coding?
>
>
> Best regards,
> Mohamed Hossam

Re: [GSoC 2026] Backup/restore utility for AsterixDB

Reply via email to